Fuzzy Extractors: How to Generate Strong Keys from Biometrics and Other Noisy Datathanks: A preliminary version of this work appeared in Eurocrypt 2004 [DRS04]. This version appears in SIAM Journal on Computing, 38(1):97–139, 2008

Yevgeniy Dodis dodis@cs.nyu.edu. New York University, Department of Computer Science, 251 Mercer St., New York, NY 10012 USA.    Rafail Ostrovsky rafail@cs.ucla.edu. University of California, Los Angeles, Department of Computer Science, Box 951596, 3732D BH, Los Angeles, CA 90095 USA.    Leonid Reyzin reyzin@cs.bu.edu. Boston University, Department of Computer Science, 111 Cummington St., Boston MA 02215 USA.    Adam Smith asmith@cse.psu.edu. Pennsylvania State University, Department of Computer Science and Engineering, 342 IST, University Park, PA 16803 USA. The research reported here was done while the author was a student at the Computer Science and Artificial Intelligence Laboratory at MIT and a postdoctoral fellow at the Weizmann Institute of Science.
(January 20, 2008)

We provide formal definitions and efficient secure techniques for

  • turning noisy information into keys usable for any cryptographic application, and, in particular,

  • reliably and securely authenticating biometric data.

Our techniques apply not just to biometric information, but to any keying material that, unlike traditional cryptographic keys, is (1) not reproducible precisely and (2) not distributed uniformly. We propose two primitives: a fuzzy extractor reliably extracts nearly uniform randomness R𝑅Ritalic_R from its input; the extraction is error-tolerant in the sense that R𝑅Ritalic_R will be the same even if the input changes, as long as it remains reasonably close to the original. Thus, R𝑅Ritalic_R can be used as a key in a cryptographic application. A secure sketch produces public information about its input w𝑤witalic_w that does not reveal w𝑤witalic_w, and yet allows exact recovery of w𝑤witalic_w given another value that is close to w𝑤witalic_w. Thus, it can be used to reliably reproduce error-prone biometric inputs without incurring the security risk inherent in storing them.

We define the primitives to be both formally secure and versatile, generalizing much prior work. In addition, we provide nearly optimal constructions of both primitives for various measures of “closeness” of input data, such as Hamming distance, edit distance, and set difference.

Key words.  fuzzy extractors, fuzzy fingerprints, randomness extractors, error-correcting codes, biometric authentication, error-tolerance, nonuniformity, password-based systems, metric embeddings

AMS subject classifications. 68P25, 68P30, 68Q99, 94A17, 94A60, 94B35, 94B99

1 Introduction

Cryptography traditionally relies on uniformly distributed and precisely reproducible random strings for its secrets. Reality, however, makes it difficult to create, store, and reliably retrieve such strings. Strings that are neither uniformly random nor reliably reproducible seem to be more plentiful. For example, a random person’s fingerprint or iris scan is clearly not a uniform random string, nor does it get reproduced precisely each time it is measured. Similarly, a long pass-phrase (or answers to 15 questions [FJ01] or a list of favorite movies [JS06]) is not uniformly random and is difficult to remember for a human user. This work is about using such nonuniform and unreliable secrets in cryptographic applications. Our approach is rigorous and general, and our results have both theoretical and practical value.

To illustrate the use of random strings on a simple example, let us consider the task of password authentication. A user Alice has a password w𝑤witalic_w and wants to gain access to her account. A trusted server stores some information y=f(w)𝑦𝑓𝑤y=f(w)italic_y = italic_f ( italic_w ) about the password. When Alice enters w𝑤witalic_w, the server lets Alice in only if f(w)=y𝑓𝑤𝑦f(w)=yitalic_f ( italic_w ) = italic_y. In this simple application, we assume that it is safe for Alice to enter the password for the verification. However, the server’s long-term storage is not assumed to be secure (e.g., y𝑦yitalic_y is stored in a publicly readable /etc/passwd file in UNIX [MT79]). The goal, then, is to design an efficient f𝑓fitalic_f that is hard to invert (i.e., given y𝑦yitalic_y it is hard to find wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT such that f(w)=y𝑓superscript𝑤𝑦f(w^{\prime})=yitalic_f ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_y), so that no one can figure out Alice’s password from y𝑦yitalic_y. Recall that such functions f𝑓fitalic_f are called one-way functions.

Unfortunately, the solution above has several problems when used with passwords w𝑤witalic_w available in real life. First, the definition of a one-way function assumes that w𝑤witalic_w is truly uniform and guarantees nothing if this is not the case. However, human-generated and biometric passwords are far from uniform, although they do have some unpredictability in them. Second, Alice has to reproduce her password exactly each time she authenticates herself. This restriction severely limits the kinds of passwords that can be used. Indeed, a human can precisely memorize and reliably type in only relatively short passwords, which do not provide an adequate level of security. Greater levels of security are achieved by longer human-generated and biometric passwords, such as pass-phrases, answers to questionnaires, handwritten signatures, fingerprints, retina scans, voice commands, and other values selected by humans or provided by nature, possibly in combination (see [Fry00] for a survey). These measurements seem to contain much more entropy than human-memorizable passwords. However, two biometric readings are rarely identical, even though they are likely to be close; similarly, humans are unlikely to precisely remember their answers to multiple questions from time to time, though such answers will likely be similar. In other words, the ability to tolerate a (limited) number of errors in the password while retaining security is crucial if we are to obtain greater security than provided by typical user-chosen short passwords.

The password authentication described above is just one example of a cryptographic application where the issues of nonuniformity and error-tolerance naturally come up. Other examples include any cryptographic application, such as encryption, signatures, or identification, where the secret key comes in the form of noisy nonuniform data.

Our Definitions.  As discussed above, an important general problem is to convert noisy nonuniform inputs into reliably reproducible, uniformly random strings. To this end, we propose a new primitive, termed fuzzy extractor. It extracts a uniformly random string R𝑅Ritalic_R from its input w𝑤witalic_w in a noise-tolerant way. Noise-tolerance means that if the input changes to some wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT but remains close, the string R𝑅Ritalic_R can be reproduced exactly. To assist in reproducing R𝑅Ritalic_R from wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, the fuzzy extractor outputs a nonsecret string P𝑃Pitalic_P. It is important to note that R𝑅Ritalic_R remains uniformly random even given P𝑃Pitalic_P. (Strictly speaking, R𝑅Ritalic_R will be ϵitalic-ϵ\epsilonitalic_ϵ-close to uniform rather than uniform; ϵitalic-ϵ\epsilonitalic_ϵ can be made exponentially small, which makes R𝑅Ritalic_R as good as uniform for the usual applications.)

Our approach is general: R𝑅Ritalic_R extracted from w𝑤witalic_w can be used as a key in a cryptographic application but unlike traditional keys, need not be stored (because it can be recovered from any wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT that is close to w𝑤witalic_w). We define fuzzy extractors to be information-theoretically secure, thus allowing them to be used in cryptographic systems without introducing additional assumptions (of course, the cryptographic application itself will typically have computational, rather than information-theoretic, security).

For a concrete example of how to use fuzzy extractors, in the password authentication case, the server can store (P,f(R))𝑃𝑓𝑅(P,f(R))( italic_P , italic_f ( italic_R ) ). When the user inputs wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT close to w𝑤witalic_w, the server reproduces the actual R𝑅Ritalic_R using P𝑃Pitalic_P and checks if f(R)𝑓𝑅f(R)italic_f ( italic_R ) matches what it stores. The presence of P𝑃Pitalic_P will help the adversary invert f(R)𝑓𝑅f(R)italic_f ( italic_R ) only by the additive amount of ϵitalic-ϵ\epsilonitalic_ϵ, because R𝑅Ritalic_R is ϵitalic-ϵ\epsilonitalic_ϵ-close to uniform even given P𝑃Pitalic_P.111 To be precise, we should note that because we do not require w𝑤witalic_w, and hence P𝑃Pitalic_P, to be efficiently samplable, we need f𝑓fitalic_f to be a one-way function even in the presence of samples from w𝑤witalic_w; this is implied by security against circuit families. Similarly, R𝑅Ritalic_R can be used for symmetric encryption, for generating a public-secret key pair, or for other applications that utilize uniformly random secrets.222 Naturally, the security of the resulting system should be properly defined and proven and will depend on the possible adversarial attacks. In particular, in this work we do not consider active attacks on P𝑃Pitalic_P or scenarios in which the adversary can force multiple invocations of the extractor with related w𝑤witalic_w and gets to observe the different P𝑃Pitalic_P values. See [Boy04, BDK+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT05, DKRS06] for follow-up work that considers attacks on the fuzzy extractor itself.

Refer to caption

Figure 1: (a) secure sketch; (b) fuzzy extractor; (c) a sample application: user who encrypts a sensitive record using a cryptographically strong, uniform key R𝑅Ritalic_R extracted from biometric w𝑤witalic_w via a fuzzy extractor; both P𝑃Pitalic_P and the encrypted record need not be kept secret, because no one can decrypt the record without a wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT that is close.

As a step in constructing fuzzy extractors, and as an interesting object in its own right, we propose another primitive, termed secure sketch. It allows precise reconstruction of a noisy input, as follows: on input w𝑤witalic_w, a procedure outputs a sketch s𝑠sitalic_s. Then, given s𝑠sitalic_s and a value wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT close to w𝑤witalic_w, it is possible to recover w𝑤witalic_w. The sketch is secure in the sense that it does not reveal much about w𝑤witalic_w: w𝑤witalic_w retains much of its entropy even if s𝑠sitalic_s is known. Thus, instead of storing w𝑤witalic_w for fear that later readings will be noisy, it is possible to store s𝑠sitalic_s instead, without compromising the privacy of w𝑤witalic_w. A secure sketch, unlike a fuzzy extractor, allows for the precise reproduction of the original input, but does not address nonuniformity.

Secure sketches, fuzzy extractors and a sample encryption application are illustrated in Figure 1.

Secure sketches and extractors can be viewed as providing fuzzy key storage: they allow recovery of the secret key (w𝑤witalic_w or R𝑅Ritalic_R) from a faulty reading wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT of the password w𝑤witalic_w by using some public information (s𝑠sitalic_s or P𝑃Pitalic_P). In particular, fuzzy extractors can be viewed as error- and nonuniformity-tolerant secret key key-encapsulation mechanisms [Sho01].

Because different biometric information has different error patterns, we do not assume any particular notion of closeness between wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and w𝑤witalic_w. Rather, in defining our primitives, we simply assume that w𝑤witalic_w comes from some metric space, and that wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is no more than a certain distance from w𝑤witalic_w in that space. We consider particular metrics only when building concrete constructions.

General Results.  Before proceeding to construct our primitives for concrete metrics, we make some observations about our definitions. We demonstrate that fuzzy extractors can be built out of secure sketches by utilizing strong randomness extractors [NZ96], such as, for example, universal hash functions [CW79, WC81] (randomness extractors, defined more precisely below, are families of hash which “convert” a high entropy input into a shorter, uniformly distributed output). We also provide a general technique for constructing secure sketches from transitive families of isometries, which is instantiated in concrete constructions later in the paper. Finally, we define a notion of a biometric embedding of one metric space into another and show that the existence of a fuzzy extractor in the target space, combined with a biometric embedding of the source into the target, implies the existence of a fuzzy extractor in the source space.

These general results help us in building and analyzing our constructions.

Our Constructions.  We provide constructions of secure sketches and fuzzy extractors in three metrics: Hamming distance, set difference, and edit distance. Unless stated otherwise, all the constructions are new.

Hamming distance (i.e., the number of symbol positions that differ between w𝑤witalic_w and wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT) is perhaps the most natural metric to consider. We observe that the “fuzzy-commitment” construction of Juels and Wattenberg [JW99] based on error-correcting codes can be viewed as a (nearly optimal) secure sketch. We then apply our general result to convert it into a nearly optimal fuzzy extractor. While our results on the Hamming distance essentially use previously known constructions, they serve as an important stepping stone for the rest of the work.

The set difference metric (i.e., size of the symmetric difference of two input sets w𝑤witalic_w and wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT) is appropriate whenever the noisy input is represented as a subset of features from a universe of possible features.333A perhaps unexpected application of the set difference metric was explored in [JS06]: a user would like to encrypt a file (e.g., her phone number) using a small subset of values from a large universe (e.g., her favorite movies) in such a way that those and only those with a similar subset (e.g., similar taste in movies) can decrypt it. We demonstrate the existence of optimal (with respect to entropy loss) secure sketches and fuzzy extractors for this metric. However, this result is mainly of theoretical interest, because (1) it relies on optimal constant-weight codes, which we do not know how to construct, and (2) it produces sketches of length proportional to the universe size. We then turn our attention to more efficient constructions for this metric in order to handle exponentially large universes. We provide two such constructions.

First, we observe that the “fuzzy vault” construction of Juels and Sudan [JS06] can be viewed as a secure sketch in this metric (and then converted to a fuzzy extractor using our general result). We provide a new, simpler analysis for this construction, which bounds the entropy lost from w𝑤witalic_w given s𝑠sitalic_s. This bound is quite high unless one makes the size of the output s𝑠sitalic_s very large. We then improve the Juels-Sudan construction to reduce the entropy loss and the length of s𝑠sitalic_s to near optimal. Our improvement in the running time and in the length of s𝑠sitalic_s is exponential for large universe sizes. However, this improved Juels-Sudan construction retains a drawback of the original: it is able to handle only sets of the same fixed size (in particular, |w|superscript𝑤|w^{\prime}|| italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | must equal |w|𝑤|w|| italic_w |.)

Second, we provide an entirely different construction, called PinSketch, that maintains the exponential improvements in sketch size and running time and also handles variable set size. To obtain it, we note that in the case of a small universe, a set can be simply encoded as its characteristic vector (1 if an element is in the set, 0 if it is not), and set difference becomes Hamming distance. Even though the length of such a vector becomes unmanageable as the universe size grows, we demonstrate that this approach can be made to work quite efficiently even for exponentially large universes (in particular, because it is not necessary to ever actually write down the vector). This involves a result that may be of independent interest: we show that BCH codes can be decoded in time polynomial in the weight of the received corrupted word (i.e., in sublinear time if the weight is small).

Finally, edit distance (i.e., the number of insertions and deletions needed to convert one string into the other) comes up, for example, when the password is entered as a string, due to typing errors or mistakes made in handwriting recognition. We discuss two approaches for secure sketches and fuzzy extractors for this metric. First, we observe that a recent low-distortion embedding of Ostrovsky and Rabani [OR05] immediately gives a construction for edit distance. The construction performs well when the number of errors to be corrected is very small (say nαsuperscript𝑛𝛼n^{\alpha}italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT for α<1𝛼1\alpha<1italic_α < 1) but cannot tolerate a large number of errors. Second, we give a biometric embedding (which is less demanding than a low-distortion embedding, but suffices for obtaining fuzzy extractors) from the edit distance metric into the set difference metric. Composing it with a fuzzy extractor for set difference gives a different construction for edit distance, which does better when t𝑡titalic_t is large; it can handle as many as O(n/log2n)𝑂𝑛superscript2𝑛O(n/\log^{2}n)italic_O ( italic_n / roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n ) errors with meaningful entropy loss.

Most of the above constructions are quite practical; some implementations are available [HJR06].

Extending Results for Probabilistic Notions of Correctness.  The definitions and constructions just described use a very strong error model: we require that secure sketches and fuzzy extractors accept every secret wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT which is sufficiently close to the original secret w𝑤witalic_w, with probability 1. Such a stringent model is useful, as it makes no assumptions on the stochastic and computational properties of the error process. However, slightly relaxing the error conditions allows constructions which tolerate a (provably) much larger number of errors, at the price of restricting the settings in which the constructions can be applied. In Section 8, we extend the definitions and constructions of earlier sections to several relaxed error models.

It is well-known that in the standard setting of error-correction for a binary communication channel, one can tolerate many more errors when the errors are random and independent than when the errors are determined adversarially. In contrast, we present fuzzy extractors that meet Shannon’s bounds for correcting random errors and, moreover, can correct the same number of errors even when errors are adversarial. In our setting, therefore, under a proper relaxation of the correctness condition, adversarial errors are no stronger than random ones. The constructions are quite simple and draw on existing techniques from the coding literature [BBR88, DGL04, Gur03, Lan04, MPSW05].

Relation to Previous Work.  Since our work combines elements of error correction, randomness extraction and password authentication, there has been a lot of related work.

The need to deal with nonuniform and low-entropy passwords has long been realized in the security community, and many approaches have been proposed. For example, Kelsey et al. [KSHW97] suggested using f(w,r)𝑓𝑤𝑟f(w,r)italic_f ( italic_w , italic_r ) in place of w𝑤witalic_w for the password authentication scenario, where r𝑟ritalic_r is a public random “salt,” to make a brute-force attacker’s life harder. While practically useful, this approach does not add any entropy to the password and does not formally address the needed properties of f𝑓fitalic_f. Another approach, more closely related to ours, is to add biometric features to the password. For example, Ellison et al. [EHMS00] proposed asking the user a series of n𝑛nitalic_n personalized questions and using these answers to encrypt the “actual” truly random secret R𝑅Ritalic_R. A similar approach using the user’s keyboard dynamics (and, subsequently, voice [MRLW01a, MRLW01b]) was proposed by Monrose et al. [MRW99]. These approaches require the design of a secure “fuzzy encryption.” The above works proposed heuristic designs (using various forms of Shamir’s secret sharing), but gave no formal analysis. Additionally, error tolerance was addressed only by brute force search.

A formal approach to error tolerance in biometrics was taken by Juels and Wattenberg [JW99] (for less formal solutions, see [DFMP99, MRW99, EHMS00]), who provided a simple way to tolerate errors in uniformly distributed passwords. Frykholm and Juels [FJ01] extended this solution and provided entropy analysis to which ours is similar. Similar approaches have been explored earlier in seemingly unrelated literature on cryptographic information reconciliation, often in the context of quantum cryptography (where Alice and Bob wish to derive a secret key from secrets that have small Hamming distance), particularly [BBR88, BBCS91]. Our construction for the Hamming distance is essentially the same as a component of the quantum oblivious transfer protocol of [BBCS91].

Juels and Sudan [JS06] provided the first construction for a metric other than Hamming: they constructed a “fuzzy vault” scheme for the set difference metric. The main difference is that [JS06] lacks a cryptographically strong definition of the object constructed. In particular, their construction leaks a significant amount of information about their analog of R𝑅Ritalic_R, even though it leaves the adversary with provably “many valid choices” for R𝑅Ritalic_R. In retrospect, their informal notion is closely related to our secure sketches. Our constructions in Section 6 improve exponentially over the construction of [JS06] for storage and computation costs, in the setting when the set elements come from a large universe.

Linnartz and Tuyls [LT03] defined and constructed a primitive very similar to a fuzzy extractor (that line of work was continued in [VTDL03].) The definition of [LT03] focuses on the continuous space nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and assumes a particular input distribution (typically a known, multivariate Gaussian). Thus, our definition of a fuzzy extractor can be viewed as a generalization of the notion of a “shielding function” from [LT03]. However, our constructions focus on discrete metric spaces.

Other approaches have also been taken for guaranteeing the privacy of noisy data. Csirmaz and Katona [CK03] considered quantization for correcting errors in “physical random functions.” (This corresponds roughly to secure sketches with no public storage.) Barral, Coron and Naccache [BCN04] proposed a system for offline, private comparison of fingerprints. Although seemingly similar, the problem they study is complementary to ours, and the two solutions can be combined to yield systems which enjoy the benefits of both.

Work on privacy amplification, e.g., [BBR88, BBCM95], as well as work on derandomization and hardness amplification, e.g., [HILL99, NZ96], also addressed the need to extract uniform randomness from a random variable about which some information has been leaked. A major focus of follow-up research has been the development of (ordinary, not fuzzy) extractors with short seeds (see [Sha02] for a survey). We use extractors in this work (though for our purposes, universal hashing is sufficient). Conversely, our work has been applied recently to privacy amplification: Ding [Din05] used fuzzy extractors for noise tolerance in Maurer’s bounded storage model [Mau93].

Independently of our work, similar techniques appeared in the literature on noncryptographic information reconciliation [MTZ03, CT04] (where the goal is communication efficiency rather than secrecy). The relationship between secure sketches and efficient information reconciliation is explored further in Section 9, which discusses, in particular, how our secure sketches for set differences provide more efficient solutions to the set and string reconciliation problems.

Follow-up Work.  Since the original presentation of this paper [DRS04], several follow-up works have appeared (e.g., [Boy04, BDK+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT05, DS05, DORS06, Smi07, CL06, LSM06, CFL06]). We refer the reader to a recent survey about fuzzy extractors [DRS07] for more information.

2 Preliminaries

Unless explicitly stated otherwise, all logarithms below are base 2222. The Hamming weight (or just weight) of a string is the number of nonzero characters in it. We use Usubscript𝑈U_{\ell}italic_U start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT to denote the uniform distribution on \ellroman_ℓ-bit binary strings. If an algorithm (or a function) f𝑓fitalic_f is randomized, we use the semicolon when we wish to make the randomness explicit: i.e., we denote by f(x;r)𝑓𝑥𝑟f(x;r)italic_f ( italic_x ; italic_r ) the result of computing f𝑓fitalic_f on input x𝑥xitalic_x with randomness r𝑟ritalic_r. If X𝑋Xitalic_X is a probability distribution, then f(X)𝑓𝑋f(X)italic_f ( italic_X ) is the distribution induced on the image of f𝑓fitalic_f by applying the (possibly probabilistic) function f𝑓fitalic_f. If X𝑋Xitalic_X is a random variable, we will (slightly) abuse notation and also denote by X𝑋Xitalic_X the probability distribution on the range of the variable.

2.1 Metric Spaces

A metric space is a set {\cal M}caligraphic_M with a distance function 𝖽𝗂𝗌:×+=[0,):𝖽𝗂𝗌superscript0{\mathsf{dis}}:{\cal M}\times{\cal M}\to\mathbb{R}^{+}=[0,\infty)sansserif_dis : caligraphic_M × caligraphic_M → blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = [ 0 , ∞ ). For the purposes of this work, {\cal M}caligraphic_M will always be a finite set, and the distance function only take on only integer values (with 𝖽𝗂𝗌(x,y)=0𝖽𝗂𝗌𝑥𝑦0{\mathsf{dis}}(x,y)=0sansserif_dis ( italic_x , italic_y ) = 0 if and only if x=y𝑥𝑦x=yitalic_x = italic_y) and will obey symmetry 𝖽𝗂𝗌(x,y)=𝖽𝗂𝗌(y,x)𝖽𝗂𝗌𝑥𝑦𝖽𝗂𝗌𝑦𝑥{\mathsf{dis}}(x,y)={\mathsf{dis}}(y,x)sansserif_dis ( italic_x , italic_y ) = sansserif_dis ( italic_y , italic_x ) and the triangle inequality 𝖽𝗂𝗌(x,z)𝖽𝗂𝗌(x,y)+𝖽𝗂𝗌(y,z)𝖽𝗂𝗌𝑥𝑧𝖽𝗂𝗌𝑥𝑦𝖽𝗂𝗌𝑦𝑧{\mathsf{dis}}(x,z)\leq{\mathsf{dis}}(x,y)+{\mathsf{dis}}(y,z)sansserif_dis ( italic_x , italic_z ) ≤ sansserif_dis ( italic_x , italic_y ) + sansserif_dis ( italic_y , italic_z ) (we adopt these requirements for simplicity of exposition, even though the definitions and most of the results below can be generalized to remove these restrictions).

We will concentrate on the following metrics.

  1. 1.

    Hamming metric. Here =nsuperscript𝑛{\cal M}={\cal F}^{n}caligraphic_M = caligraphic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT for some alphabet {\cal F}caligraphic_F, and 𝖽𝗂𝗌(w,w)𝖽𝗂𝗌𝑤superscript𝑤{\mathsf{dis}(w,w^{\prime})}sansserif_dis ( italic_w , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is the number of positions in which the strings w𝑤witalic_w and wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT differ.

  2. 2.

    Set difference metric. Here {\cal M}caligraphic_M consists of all subsets of a universe 𝒰𝒰{\cal U}caligraphic_U. For two sets w,w𝑤superscript𝑤w,w^{\prime}italic_w , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, their symmetric difference ww=def{xwwxww}superscriptdef𝑤superscript𝑤conditional-set𝑥𝑤superscript𝑤𝑥𝑤superscript𝑤w\triangle w^{\prime}\stackrel{{\scriptstyle\rm def}}{{=}}\{x\in w\cup w^{\prime}\mid x\notin w\cap w^{\prime}\}italic_w △ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG roman_def end_ARG end_RELOP { italic_x ∈ italic_w ∪ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x ∉ italic_w ∩ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT }. The distance between two sets w,w𝑤superscript𝑤w,w^{\prime}italic_w , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is |ww|𝑤superscript𝑤|w\triangle w^{\prime}|| italic_w △ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT |444In the preliminary version of this work [DRS04], we worked with this metric scaled by 1212\frac{1}{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG; that is, the distance was 12|ww|12𝑤superscript𝑤\frac{1}{2}|w\triangle w^{\prime}|divide start_ARG 1 end_ARG start_ARG 2 end_ARG | italic_w △ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT |. Not scaling makes more sense, particularly when w𝑤witalic_w and wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are of potentially different sizes since |ww|𝑤superscript𝑤|w\triangle w^{\prime}|| italic_w △ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | may be odd. It also agrees with the hamming distance of characteristic vectors; see Section 6. We will sometimes restrict {\cal M}caligraphic_M to contain only s𝑠sitalic_s-element subsets for some s𝑠sitalic_s.

  3. 3.

    Edit metric. Here =*superscript{\cal M}={\cal F}^{*}caligraphic_M = caligraphic_F start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, and the distance between w𝑤witalic_w and wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is defined to be the smallest number of character insertions and deletions needed to transform w𝑤witalic_w into wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT555Again, in [DRS04], we worked with this metric scaled by 1212\frac{1}{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG. Likewise, this makes little sense when strings can be of different lengths, and we avoid it here. (This is different from the Hamming metric because insertions and deletions shift the characters that are to the right of the insertion/deletion point.)

As already mentioned, all three metrics seem natural for biometric data.

2.2 Codes and Syndromes

Since we want to achieve error tolerance in various metric spaces, we will use error-correcting codes for a particular metric. A code C𝐶Citalic_C is a subset {w0,,wK1}subscript𝑤0subscript𝑤𝐾1\left\{{w_{0},\ldots,w_{K-1}}\right\}{ italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT } of K𝐾Kitalic_K elements of {\cal M}caligraphic_M. The map from i𝑖iitalic_i to wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which we will also sometimes denote by C𝐶Citalic_C, is called encoding. The minimum distance of C𝐶Citalic_C is the smallest d>0𝑑0d>0italic_d > 0 such that for all ij𝑖𝑗i\neq jitalic_i ≠ italic_j we have 𝖽𝗂𝗌(wi,wj)d𝖽𝗂𝗌subscript𝑤𝑖subscript𝑤𝑗𝑑{\mathsf{dis}(w_{i},w_{j})}\geq dsansserif_dis ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ≥ italic_d. In our case of integer metrics, this means that one can detect up to (d1)𝑑1(d-1)( italic_d - 1 ) “errors” in an element of {\cal M}caligraphic_M. The error-correcting distance of C𝐶Citalic_C is the largest number t>0𝑡0t>0italic_t > 0 such that for every w𝑤w\in{\cal M}italic_w ∈ caligraphic_M there exists at most one codeword c𝑐citalic_c in the ball of radius t𝑡titalic_t around w𝑤witalic_w: 𝖽𝗂𝗌(w,c)t𝖽𝗂𝗌𝑤𝑐𝑡{\mathsf{dis}(w,c)}\leq tsansserif_dis ( italic_w , italic_c ) ≤ italic_t for at most one cC𝑐𝐶c\in Citalic_c ∈ italic_C. This means that one can correct up to t𝑡titalic_t errors in an element w𝑤witalic_w of {\cal M}caligraphic_M; we will use the term decoding for the map that finds, given w𝑤witalic_w, the cC𝑐𝐶c\in Citalic_c ∈ italic_C such that 𝖽𝗂𝗌(w,c)t𝖽𝗂𝗌𝑤𝑐𝑡{\mathsf{dis}(w,c)}\leq tsansserif_dis ( italic_w , italic_c ) ≤ italic_t (note that for some w𝑤witalic_w, such c𝑐citalic_c may not exist, but if it exists, it will be unique; note also that decoding is not the inverse of encoding in our terminology). For integer metrics by triangle inequality we are guaranteed that t(d1)/2𝑡𝑑12t\geq\lfloor(d-1)/2\rflooritalic_t ≥ ⌊ ( italic_d - 1 ) / 2 ⌋. Since error correction will be more important than error detection in our applications, we denote the corresponding codes as (,K,t)𝐾𝑡({\cal M},K,t)( caligraphic_M , italic_K , italic_t )-codes. For efficiency purposes, we will often want encoding and decoding to be polynomial-time.

For the Hamming metric over nsuperscript𝑛{\cal F}^{n}caligraphic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, we will sometimes call k=log||K𝑘subscript𝐾k=\log_{|{\cal F}|}Kitalic_k = roman_log start_POSTSUBSCRIPT | caligraphic_F | end_POSTSUBSCRIPT italic_K the dimension of the code and denote the code itself as an [n,k,d=2t+1]subscriptdelimited-[]𝑛𝑘𝑑2𝑡1[n,k,d=2t+1]_{\cal F}[ italic_n , italic_k , italic_d = 2 italic_t + 1 ] start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT-code, following the standard notation in the literature. We will denote by A||(n,d)subscript𝐴𝑛𝑑A_{|{\cal F}|}(n,d)italic_A start_POSTSUBSCRIPT | caligraphic_F | end_POSTSUBSCRIPT ( italic_n , italic_d ) the maximum K𝐾Kitalic_K possible in such a code (omitting the subscript when ||=22|{\cal F}|=2| caligraphic_F | = 2), and by A(n,d,s)𝐴𝑛𝑑𝑠A(n,d,s)italic_A ( italic_n , italic_d , italic_s ) the maximum K𝐾Kitalic_K for such a code over {0,1}nsuperscript01𝑛\{0,1\}^{n}{ 0 , 1 } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT with the additional restriction that all codewords have exactly s𝑠sitalic_s ones.

If the code is linear (i.e., {\cal F}caligraphic_F is a field, nsuperscript𝑛{\cal F}^{n}caligraphic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is a vector space over {\cal F}caligraphic_F, and C𝐶Citalic_C is a linear subspace), then one can fix a parity-check matrix H𝐻Hitalic_H as any matrix whose rows generate the orthogonal space Csuperscript𝐶perpendicular-toC^{\perp}italic_C start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT. Then for any vn𝑣superscript𝑛v\in{\cal F}^{n}italic_v ∈ caligraphic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, the syndrome 𝗌𝗒𝗇(v)=defHvsuperscriptdef𝗌𝗒𝗇𝑣𝐻𝑣{\mathsf{syn}}(v)\stackrel{{\scriptstyle\rm def}}{{=}}Hvsansserif_syn ( italic_v ) start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG roman_def end_ARG end_RELOP italic_H italic_v. The syndrome of a vector is its projection onto subspace that is orthogonal to the code and can thus be intuitively viewed as the vector modulo the code. Note that vC𝗌𝗒𝗇(v)=0𝑣𝐶𝗌𝗒𝗇𝑣0v\in C\Leftrightarrow{\mathsf{syn}}(v)=0italic_v ∈ italic_C ⇔ sansserif_syn ( italic_v ) = 0. Note also that H𝐻Hitalic_H is an (nk)×n𝑛𝑘𝑛(n-k)\times n( italic_n - italic_k ) × italic_n matrix and that 𝗌𝗒𝗇(v)𝗌𝗒𝗇𝑣{\mathsf{syn}}(v)sansserif_syn ( italic_v ) is nk𝑛𝑘n-kitalic_n - italic_k bits long.

The syndrome captures all the information necessary for decoding. That is, suppose a codeword c𝑐citalic_c is sent through a channel and the word w=c+e𝑤𝑐𝑒w=c+eitalic_w = italic_c + italic_e is received. First, the syndrome of w𝑤witalic_w is the syndrome of e𝑒eitalic_e: 𝗌𝗒𝗇(w)=𝗌𝗒𝗇(c)+𝗌𝗒𝗇(e)=0+𝗌𝗒𝗇(e)=𝗌𝗒𝗇(e)𝗌𝗒𝗇𝑤𝗌𝗒𝗇𝑐𝗌𝗒𝗇𝑒0𝗌𝗒𝗇𝑒𝗌𝗒𝗇𝑒{\mathsf{syn}}(w)={\mathsf{syn}}(c)+{\mathsf{syn}}(e)=0+{\mathsf{syn}}(e)={\mathsf{syn}}(e)sansserif_syn ( italic_w ) = sansserif_syn ( italic_c ) + sansserif_syn ( italic_e ) = 0 + sansserif_syn ( italic_e ) = sansserif_syn ( italic_e ). Moreover, for any value u𝑢uitalic_u, there is at most one word e𝑒eitalic_e of weight less than d/2𝑑2d/2italic_d / 2 such that 𝗌𝗒𝗇(e)=u𝗌𝗒𝗇𝑒𝑢{\mathsf{syn}}(e)=usansserif_syn ( italic_e ) = italic_u (because the existence of a pair of distinct words e1,e2subscript𝑒1subscript𝑒2e_{1},e_{2}italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT would mean that e1e2subscript𝑒1subscript𝑒2e_{1}-e_{2}italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is a codeword of weight less than d𝑑ditalic_d, but since 0nsuperscript0𝑛0^{n}0 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is also a codeword and the minimum distance of the code is d𝑑ditalic_d, this is impossible). Thus, knowing syndrome 𝗌𝗒𝗇(w)𝗌𝗒𝗇𝑤{\mathsf{syn}}(w)sansserif_syn ( italic_w ) is enough to determine the error pattern e𝑒eitalic_e if not too many errors occurred.

2.3 Min-Entropy, Statistical Distance, Universal Hashing, and Strong Extractors

When discussing security, one is often interested in the probability that the adversary predicts a random value (e.g., guesses a secret key). The adversary’s best strategy, of course, is to guess the most likely value. Thus, predictability of a random variable A𝐴Aitalic_A is maxaPr[A=a]subscript𝑎Pr𝐴𝑎\max_{a}\Pr[A=a]roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT roman_Pr [ italic_A = italic_a ], and, correspondingly, min-entropy 𝐇(A)subscript𝐇𝐴{\mathbf{H}_{\infty}}(A)bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_A ) is log(maxaPr[A=a])subscript𝑎Pr𝐴𝑎-\log(\max_{a}\Pr[A=a])- roman_log ( roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT roman_Pr [ italic_A = italic_a ] ) (min-entropy can thus be viewed as the “worst-case” entropy [CG88]; see also Section 2.4).

The min-entropy of a distribution tells us how many nearly uniform random bits can be extracted from it. The notion of “nearly” is defined as follows. The statistical distance between two probability distributions A𝐴Aitalic_A and B𝐵Bitalic_B is 𝐒𝐃(A,B)=12v|Pr(A=v)Pr(B=v)|𝐒𝐃𝐴𝐵12subscript𝑣Pr𝐴𝑣Pr𝐵𝑣\mathbf{SD}\left({{A,B}}\right)=\frac{1}{2}\sum_{v}|\Pr(A=v)-\Pr(B=v)|bold_SD ( italic_A , italic_B ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT | roman_Pr ( italic_A = italic_v ) - roman_Pr ( italic_B = italic_v ) |.

Recall the definition of strong randomness extractors [NZ96].

Definition 1.

Let 𝖤𝗑𝗍:{0,1}n{0,1}:𝖤𝗑𝗍superscript01𝑛superscript01\mathsf{Ext}:\{0,1\}^{n}\to\{0,1\}^{\ell}sansserif_Ext : { 0 , 1 } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → { 0 , 1 } start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT be a polynomial time probabilistic function which uses r𝑟ritalic_r bits of randomness. We say that 𝖤𝗑𝗍𝖤𝗑𝗍\mathsf{Ext}sansserif_Ext is an efficient (n,m,,ϵ)𝑛𝑚normal-ℓitalic-ϵ(n,m,\ell,\epsilon)( italic_n , italic_m , roman_ℓ , italic_ϵ )-strong extractor if for all min-entropy m𝑚mitalic_m distributions W𝑊Witalic_W on {0,1}nsuperscript01𝑛\{0,1\}^{n}{ 0 , 1 } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, 𝐒𝐃((𝖤𝗑𝗍(W;X),X),(U,X))ϵ,𝐒𝐃𝖤𝗑𝗍𝑊𝑋𝑋subscript𝑈𝑋italic-ϵ\mathbf{SD}\left({{({\mathsf{Ext}(W;X),X}),({U_{\ell},X})}}\right)\leq\epsilon,bold_SD ( ( sansserif_Ext ( italic_W ; italic_X ) , italic_X ) , ( italic_U start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_X ) ) ≤ italic_ϵ , where X𝑋Xitalic_X is uniform on {0,1}rsuperscript01𝑟\{0,1\}^{r}{ 0 , 1 } start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT.

Strong extractors can extract at most =m2log(1ϵ)+O(1)𝑚21italic-ϵ𝑂1\ell=m-2\log\left({\frac{1}{\epsilon}}\right)+O(1)roman_ℓ = italic_m - 2 roman_log ( divide start_ARG 1 end_ARG start_ARG italic_ϵ end_ARG ) + italic_O ( 1 ) nearly random bits [RTS00]. Many constructions match this bound (see Shaltiel’s survey [Sha02] for references). Extractor constructions are often complex since they seek to minimize the length of the seed X𝑋Xitalic_X. For our purposes, the length of X𝑋Xitalic_X will be less important, so universal hash functions  [CW79, WC81] (defined in the lemma below) will already give us the optimal =m2log(1ϵ)+2𝑚21italic-ϵ2\ell=m-2\log\left({\frac{1}{\epsilon}}\right)+2roman_ℓ = italic_m - 2 roman_log ( divide start_ARG 1 end_ARG start_ARG italic_ϵ end_ARG ) + 2, as given by the leftover hash lemma below (see [HILL99, Lemma 4.8] as well as references therein for earlier versions):

Lemma 2.1 (Universal Hash Functions and the Leftover-Hash / Privacy-Amplification Lemma).

Assume a family of functions {Hx:{0,1}n{0,1}}xXsubscriptconditional-setsubscript𝐻𝑥normal-→superscript01𝑛superscript01normal-ℓ𝑥𝑋\{H_{x}:\{0,1\}^{n}\to\{0,1\}^{\ell}\}_{x\in X}{ italic_H start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT : { 0 , 1 } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → { 0 , 1 } start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_x ∈ italic_X end_POSTSUBSCRIPT is universal: for all ab{0,1}n𝑎𝑏superscript01𝑛a\neq b\in\{0,1\}^{n}italic_a ≠ italic_b ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, PrxX[Hx(a)=Hx(b)]=2subscriptnormal-Pr𝑥𝑋subscript𝐻𝑥𝑎subscript𝐻𝑥𝑏superscript2normal-ℓ\Pr_{x\in X}[H_{x}(a)=H_{x}(b)]=2^{-\ell}roman_Pr start_POSTSUBSCRIPT italic_x ∈ italic_X end_POSTSUBSCRIPT [ italic_H start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_a ) = italic_H start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_b ) ] = 2 start_POSTSUPERSCRIPT - roman_ℓ end_POSTSUPERSCRIPT. Then, for any random variable W𝑊Witalic_W,666In [HILL99], this inequality is formulated in terms of Rényi entropy of order two of W𝑊Witalic_W; the change to 𝐇(C)subscript𝐇𝐶{\mathbf{H}_{\infty}}(C)bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_C ) is allowed because the latter is no greater than the former.

𝐒𝐃((HX(W),X),(U,X))122𝐇(W)2.𝐒𝐃subscript𝐻𝑋𝑊𝑋subscript𝑈𝑋12superscript2subscript𝐇𝑊superscript2\mathbf{SD}\left({{({H_{X}(W),X})\ ,\ ({U_{\ell},X})}}\right)\leq\frac{1}{2}\sqrt{2^{-{\mathbf{H}_{\infty}}(W)}2^{\ell}}\,.bold_SD ( ( italic_H start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_W ) , italic_X ) , ( italic_U start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_X ) ) ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG square-root start_ARG 2 start_POSTSUPERSCRIPT - bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_W ) end_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT end_ARG . (1)

In particular, universal hash functions are (n,m,,ϵ)𝑛𝑚normal-ℓitalic-ϵ(n,m,\ell,\epsilon)( italic_n , italic_m , roman_ℓ , italic_ϵ )-strong extractors whenever m2log(1ϵ)+2normal-ℓ𝑚21italic-ϵ2\ell\leq m-2\log\left({\frac{1}{\epsilon}}\right)+2roman_ℓ ≤ italic_m - 2 roman_log ( divide start_ARG 1 end_ARG start_ARG italic_ϵ end_ARG ) + 2.

2.4 Average Min-Entropy

Recall that predictability of a random variable A𝐴Aitalic_A is maxaPr[A=a]subscript𝑎Pr𝐴𝑎\max_{a}\Pr[A=a]roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT roman_Pr [ italic_A = italic_a ], and its min-entropy 𝐇(A)subscript𝐇𝐴{\mathbf{H}_{\infty}}(A)bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_A ) is log(maxaPr[A=a])subscript𝑎Pr𝐴𝑎-\log(\max_{a}\Pr[A=a])- roman_log ( roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT roman_Pr [ italic_A = italic_a ] ). Consider now a pair of (possibly correlated) random variables A,B𝐴𝐵A,Bitalic_A , italic_B. If the adversary finds out the value b𝑏bitalic_b of B𝐵Bitalic_B, then predictability of A𝐴Aitalic_A becomes maxaPr[A=aB=b]subscript𝑎Pr𝐴conditional𝑎𝐵𝑏\max_{a}\Pr[A=a\mid B=b]roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT roman_Pr [ italic_A = italic_a ∣ italic_B = italic_b ]. On average, the adversary’s chance of success in predicting A𝐴Aitalic_A is then 𝔼bB[maxaPr[A=aB=b]]subscript𝔼𝑏𝐵delimited-[]subscript𝑎Pr𝐴conditional𝑎𝐵𝑏{\mathbb{E}}_{{b\leftarrow{B}}}\left[{\max_{a}\Pr[A=a\mid B=b]}\right]blackboard_E start_POSTSUBSCRIPT italic_b ← italic_B end_POSTSUBSCRIPT [ roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT roman_Pr [ italic_A = italic_a ∣ italic_B = italic_b ] ]. Note that we are taking the average over B𝐵Bitalic_B (which is not under adversarial control), but the worst case over A𝐴Aitalic_A (because prediction of A𝐴Aitalic_A is adversarial once b𝑏bitalic_b is known). Again, it is convenient to talk about security in log-scale, which is why we define the average min-entropy of A𝐴Aitalic_A given B𝐵Bitalic_B as simply the logarithm of the above:

𝐇~(AB)=deflog(𝔼bB[maxaPr[A=aB=b]])=log(𝔼bB[2𝐇(AB=b)]).superscriptdefsubscript~𝐇conditional𝐴𝐵subscript𝔼𝑏𝐵delimited-[]subscript𝑎Pr𝐴conditional𝑎𝐵𝑏subscript𝔼𝑏𝐵delimited-[]superscript2subscript𝐇conditional𝐴𝐵𝑏{\tilde{\mathbf{H}}_{\infty}}(A\mid B)\stackrel{{\scriptstyle\rm def}}{{=}}-\log\left({{\mathbb{E}}_{{b\leftarrow{B}}}\left[{\max_{a}\Pr[A=a\mid B=b]}\right]}\right)=-\log\left({{\mathbb{E}}_{{b\leftarrow{B}}}\left[{2^{-{\mathbf{H}_{\infty}}(A\mid B=b)}}\right]}\right)\,.over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_A ∣ italic_B ) start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG roman_def end_ARG end_RELOP - roman_log ( blackboard_E start_POSTSUBSCRIPT italic_b ← italic_B end_POSTSUBSCRIPT [ roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT roman_Pr [ italic_A = italic_a ∣ italic_B = italic_b ] ] ) = - roman_log ( blackboard_E start_POSTSUBSCRIPT italic_b ← italic_B end_POSTSUBSCRIPT [ 2 start_POSTSUPERSCRIPT - bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_A ∣ italic_B = italic_b ) end_POSTSUPERSCRIPT ] ) .

Because other notions of entropy have been studied in cryptographic literature, a few words are in order to explain why this definition is useful. Note the importance of taking the logarithm after taking the average (in contrast, for instance, to conditional Shannon entropy). One may think it more natural to define average min-entropy as 𝔼bB[𝐇(AB=b)]subscript𝔼𝑏𝐵delimited-[]subscript𝐇conditional𝐴𝐵𝑏{\mathbb{E}}_{{b\leftarrow{B}}}\left[{{\mathbf{H}_{\infty}}(A\mid B=b)}\right]blackboard_E start_POSTSUBSCRIPT italic_b ← italic_B end_POSTSUBSCRIPT [ bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_A ∣ italic_B = italic_b ) ], thus reversing the order of log\logroman_log and 𝔼𝔼{\mathbb{E}}blackboard_E. However, this notion is unlikely to be useful in a security application. For a simple example, consider the case when A𝐴Aitalic_A and B𝐵Bitalic_B are 1000-bit strings distributed as follows: B=U1000𝐵subscript𝑈1000B=U_{1000}italic_B = italic_U start_POSTSUBSCRIPT 1000 end_POSTSUBSCRIPT and A𝐴Aitalic_A is equal to the value b𝑏bitalic_b of B𝐵Bitalic_B if the first bit of b𝑏bitalic_b is 0, and U1000subscript𝑈1000U_{1000}italic_U start_POSTSUBSCRIPT 1000 end_POSTSUBSCRIPT (independent of B𝐵Bitalic_B) otherwise. Then for half of the values of b𝑏bitalic_b, 𝐇(AB=b)=0subscript𝐇conditional𝐴𝐵𝑏0{\mathbf{H}_{\infty}}(A\mid B=b)=0bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_A ∣ italic_B = italic_b ) = 0, while for the other half, 𝐇(AB=b)=1000subscript𝐇conditional𝐴𝐵𝑏1000{\mathbf{H}_{\infty}}(A\mid B=b)=1000bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_A ∣ italic_B = italic_b ) = 1000, so 𝔼bB[𝐇(AB=b)]=500subscript𝔼𝑏𝐵delimited-[]subscript𝐇conditional𝐴𝐵𝑏500{\mathbb{E}}_{{b\leftarrow B}}\left[{{\mathbf{H}_{\infty}}(A\mid B=b)}\right]=500blackboard_E start_POSTSUBSCRIPT italic_b ← italic_B end_POSTSUBSCRIPT [ bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_A ∣ italic_B = italic_b ) ] = 500. However, it would be obviously incorrect to say that A𝐴Aitalic_A has 500 bits of security. In fact, an adversary who knows the value b𝑏bitalic_b of B𝐵Bitalic_B has a slightly greater than 50%percent5050\%50 % chance of predicting the value of A𝐴Aitalic_A by outputting b𝑏bitalic_b. Our definition correctly captures this 50%percent5050\%50 % chance of prediction, because 𝐇~(AB)subscript~𝐇conditional𝐴𝐵{\tilde{\mathbf{H}}_{\infty}}(A\mid B)over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_A ∣ italic_B ) is slightly less than 1. In fact, our definition of average min-entropy is simply the logarithm of predictability.

The following useful properties of average min-entropy are proven in Appendix A. We also refer the reader to Appendix B for a generalization of average min-entropy and a discussion of the relationship between this notion and other notions of entropy.

Lemma 2.2.

Let A,B,C𝐴𝐵𝐶A,B,Citalic_A , italic_B , italic_C be random variables. Then

  • (a)

    For any δ>0𝛿0\delta>0italic_δ > 0, the conditional entropy 𝐇(A|B=b)subscript𝐇conditional𝐴𝐵𝑏{\mathbf{H}_{\infty}}(A|B=b)bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_A | italic_B = italic_b ) is at least 𝐇~(A|B)log(1/δ)subscript~𝐇conditional𝐴𝐵1𝛿{\tilde{\mathbf{H}}_{\infty}}(A|B)-\log(1/\delta)over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_A | italic_B ) - roman_log ( 1 / italic_δ ) with probability at least 1δ1𝛿1-\delta1 - italic_δ over the choice of b𝑏bitalic_b.

  • (b)

    If B𝐵Bitalic_B has at most 2λsuperscript2𝜆2^{\lambda}2 start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT possible values, then 𝐇~(A(B,C))𝐇~((A,B)C)λ𝐇~(AC)λsubscript~𝐇conditional𝐴𝐵𝐶subscript~𝐇conditional𝐴𝐵𝐶𝜆subscript~𝐇conditional𝐴𝐶𝜆{\tilde{\mathbf{H}}_{\infty}}(A\mid(B,C))\geq{\tilde{\mathbf{H}}_{\infty}}((A,B)\mid C)-{\lambda}\geq{\tilde{\mathbf{H}}_{\infty}}(A\mid C)-{\lambda}over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_A ∣ ( italic_B , italic_C ) ) ≥ over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( ( italic_A , italic_B ) ∣ italic_C ) - italic_λ ≥ over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_A ∣ italic_C ) - italic_λ. In particular, 𝐇~(AB)𝐇((A,B))λ𝐇(A)λsubscript~𝐇conditional𝐴𝐵subscript𝐇𝐴𝐵𝜆subscript𝐇𝐴𝜆{\tilde{\mathbf{H}}_{\infty}}(A\mid B)\geq{\mathbf{H}_{\infty}}((A,B))-{\lambda}\geq{\mathbf{H}_{\infty}}(A)-{\lambda}over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_A ∣ italic_B ) ≥ bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( ( italic_A , italic_B ) ) - italic_λ ≥ bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_A ) - italic_λ.

2.5 Average-Case Extractors

Recall from Definition 1 that a strong extractor allows one to extract almost all the min-entropy from some nonuniform random variable W𝑊Witalic_W. In many situations, W𝑊Witalic_W represents the adversary’s uncertainty about some secret w𝑤witalic_w conditioned on some side information i𝑖iitalic_i. Since this side information i𝑖iitalic_i is often probabilistic, we shall find the following generalization of a strong extractor useful (see Lemma 4.1).

Definition 2.

Let 𝖤𝗑𝗍:{0,1}n{0,1}:𝖤𝗑𝗍superscript01𝑛superscript01\mathsf{Ext}:\{0,1\}^{n}\to\{0,1\}^{\ell}sansserif_Ext : { 0 , 1 } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → { 0 , 1 } start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT be a polynomial time probabilistic function which uses r𝑟ritalic_r bits of randomness. We say that 𝖤𝗑𝗍𝖤𝗑𝗍\mathsf{Ext}sansserif_Ext is an efficient average-case (n,m,,ϵ)𝑛𝑚italic-ϵ(n,m,\ell,\epsilon)( italic_n , italic_m , roman_ℓ , italic_ϵ )-strong extractor if for all pairs of random variables (W,I)𝑊𝐼(W,I)( italic_W , italic_I ) such that W𝑊Witalic_W is an n𝑛nitalic_n-bit string satisfying 𝐇~(WI)msubscript~𝐇conditional𝑊𝐼𝑚{\tilde{\mathbf{H}}_{\infty}}(W\mid I)\geq mover~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_W ∣ italic_I ) ≥ italic_m, we have 𝐒𝐃((𝖤𝗑𝗍(W;X),X,I),(U,X,I))ϵ𝐒𝐃𝖤𝗑𝗍𝑊𝑋𝑋𝐼subscript𝑈𝑋𝐼italic-ϵ\mathbf{SD}\left({{({\mathsf{Ext}(W;X),X,I}),({U_{\ell},X,I})}}\right)\leq\epsilonbold_SD ( ( sansserif_Ext ( italic_W ; italic_X ) , italic_X , italic_I ) , ( italic_U start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_X , italic_I ) ) ≤ italic_ϵ, where X𝑋Xitalic_X is uniform on {0,1}rsuperscript01𝑟\{0,1\}^{r}{ 0 , 1 } start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT.

To distinguish the strong extractors of Definition 1 from average-case strong extractors, we will sometimes call the former worst-case strong extractors. The two notions are closely related, as can be seen from the following simple application of Lemma 2.2(a).

Lemma 2.3.

For any δ>0𝛿0\delta>0italic_δ > 0, if 𝖤𝗑𝗍𝖤𝗑𝗍\mathsf{Ext}sansserif_Ext is a (worst-case) (n,mlog(1δ),,ϵ)𝑛𝑚1𝛿normal-ℓitalic-ϵ(n,m-\log\left({\frac{1}{\delta}}\right),\ell,\epsilon)( italic_n , italic_m - roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) , roman_ℓ , italic_ϵ )-strong extractor, then 𝖤𝗑𝗍𝖤𝗑𝗍\mathsf{Ext}sansserif_Ext is also an average-case (n,m,,ϵ+δ)𝑛𝑚normal-ℓitalic-ϵ𝛿(n,m,\ell,\epsilon+\delta)( italic_n , italic_m , roman_ℓ , italic_ϵ + italic_δ )-strong extractor.

Proof.

Assume (W,I)𝑊𝐼(W,I)( italic_W , italic_I ) are such that 𝐇~(WI)msubscript~𝐇conditional𝑊𝐼𝑚{\tilde{\mathbf{H}}_{\infty}}(W\mid I)\geq mover~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_W ∣ italic_I ) ≥ italic_m. Let Wi=(WI=i)subscript𝑊𝑖conditional𝑊𝐼𝑖W_{i}=(W\mid I=i)italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_W ∣ italic_I = italic_i ) and let us call the value i𝑖iitalic_i “bad” if 𝐇(Wi)<mlog(1δ)subscript𝐇subscript𝑊𝑖𝑚1𝛿{\mathbf{H}_{\infty}}(W_{i})<m-\log\left({\frac{1}{\delta}}\right)bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) < italic_m - roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ). Otherwise, we say that i𝑖iitalic_i is “good”. By Lemma 2.2(a), Pr(i is bad)δPr𝑖 is bad𝛿\Pr(i\mbox{~{}is~{}bad})\leq\deltaroman_Pr ( italic_i is bad ) ≤ italic_δ. Also, for any good i𝑖iitalic_i, we have that 𝖤𝗑𝗍𝖤𝗑𝗍\mathsf{Ext}sansserif_Ext extracts \ellroman_ℓ bits that are ϵitalic-ϵ\epsilonitalic_ϵ-close to uniform from Wisubscript𝑊𝑖W_{i}italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Thus, by conditioning on the “goodness” of I𝐼Iitalic_I, we get

𝐒𝐃((𝖤𝗑𝗍(W;X),X,I),(U,X,I))𝐒𝐃𝖤𝗑𝗍𝑊𝑋𝑋𝐼subscript𝑈𝑋𝐼\displaystyle\mathbf{SD}\left({{({\mathsf{Ext}(W;X),X,I}),({U_{\ell},X,I})}}\right)bold_SD ( ( sansserif_Ext ( italic_W ; italic_X ) , italic_X , italic_I ) , ( italic_U start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_X , italic_I ) ) =\displaystyle== iPr(i)𝐒𝐃((𝖤𝗑𝗍(Wi;X),X),(U,X))subscript𝑖Pr𝑖𝐒𝐃𝖤𝗑𝗍subscript𝑊𝑖𝑋𝑋subscript𝑈𝑋\displaystyle\sum_{i}\Pr(i)\cdot\mathbf{SD}\left({{({\mathsf{Ext}(W_{i};X),X}),({U_{\ell},X})}}\right)∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Pr ( italic_i ) ⋅ bold_SD ( ( sansserif_Ext ( italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_X ) , italic_X ) , ( italic_U start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_X ) )
\displaystyle\leq Pr(i is bad)1+good iPr(i)𝐒𝐃((𝖤𝗑𝗍(Wi;X),X),(U,X))Pr𝑖 is bad1subscriptgood 𝑖Pr𝑖𝐒𝐃𝖤𝗑𝗍subscript𝑊𝑖𝑋𝑋subscript𝑈𝑋\displaystyle\Pr(i\mbox{~{}is~{}bad})\cdot 1+\sum_{\mbox{\tiny{good}~{}}i}\Pr(i)\cdot\mathbf{SD}\left({{({\mathsf{Ext}(W_{i};X),X}),({U_{\ell},X})}}\right)roman_Pr ( italic_i is bad ) ⋅ 1 + ∑ start_POSTSUBSCRIPT good italic_i end_POSTSUBSCRIPT roman_Pr ( italic_i ) ⋅ bold_SD ( ( sansserif_Ext ( italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_X ) , italic_X ) , ( italic_U start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_X ) )
\displaystyle\leq δ+ϵ𝛿italic-ϵ\displaystyle\delta+\epsilonitalic_δ + italic_ϵ

However, for many strong extractors we do not have to suffer this additional dependence on δ𝛿\deltaitalic_δ, because the strong extractor may be already average-case. In particular, this holds for extractors obtained via universal hashing.

Lemma 2.4 (Generalized Leftover Hash Lemma).

Assume {Hx:{0,1}n{0,1}}xXsubscriptconditional-setsubscript𝐻𝑥normal-→superscript01𝑛superscript01normal-ℓ𝑥𝑋\{H_{x}:\{0,1\}^{n}\to\{0,1\}^{\ell}\}_{x\in X}{ italic_H start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT : { 0 , 1 } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → { 0 , 1 } start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_x ∈ italic_X end_POSTSUBSCRIPT is a family of universal hash functions. Then, for any random variables W𝑊Witalic_W and I𝐼Iitalic_I,

𝐒𝐃((HX(W),X,I),(U,X,I))122𝐇~(WI)2.𝐒𝐃subscript𝐻𝑋𝑊𝑋𝐼subscript𝑈𝑋𝐼12superscript2subscript~𝐇conditional𝑊𝐼superscript2\mathbf{SD}\left({{({H_{X}(W),X,I})\ ,\ ({U_{\ell},X,I})}}\right)\leq\frac{1}{2}\sqrt{2^{-{\tilde{\mathbf{H}}_{\infty}}(W\mid I)}2^{\ell}}\,.bold_SD ( ( italic_H start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_W ) , italic_X , italic_I ) , ( italic_U start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_X , italic_I ) ) ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG square-root start_ARG 2 start_POSTSUPERSCRIPT - over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_W ∣ italic_I ) end_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT end_ARG . (2)

In particular, universal hash functions are average-case (n,m,,ϵ)𝑛𝑚normal-ℓitalic-ϵ(n,m,\ell,\epsilon)( italic_n , italic_m , roman_ℓ , italic_ϵ )-strong extractors whenever m2log(1ϵ)+2normal-ℓ𝑚21italic-ϵ2\ell\leq m-2\log\left({\frac{1}{\epsilon}}\right)+2roman_ℓ ≤ italic_m - 2 roman_log ( divide start_ARG 1 end_ARG start_ARG italic_ϵ end_ARG ) + 2.

Proof.

Let Wi=(WI=i)subscript𝑊𝑖conditional𝑊𝐼𝑖W_{i}=(W\mid I=i)italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_W ∣ italic_I = italic_i ). Then

𝐒𝐃((HX(W),X,I),(U,X,I))𝐒𝐃subscript𝐻𝑋𝑊𝑋𝐼subscript𝑈𝑋𝐼\displaystyle\mathbf{SD}\left({{({H_{X}(W),X,I})\ ,\ ({U_{\ell},X,I})}}\right)bold_SD ( ( italic_H start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_W ) , italic_X , italic_I ) , ( italic_U start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_X , italic_I ) ) =\displaystyle== 𝔼i[𝐒𝐃((HX(Wi),X),(U,X))]subscript𝔼𝑖delimited-[]𝐒𝐃subscript𝐻𝑋subscript𝑊𝑖𝑋subscript𝑈𝑋\displaystyle{\mathbb{E}}_{{i}}\left[{\mathbf{SD}\left({{({H_{X}(W_{i}),X})\ ,\ ({U_{\ell},X})}}\right)}\right]blackboard_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ bold_SD ( ( italic_H start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_X ) , ( italic_U start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_X ) ) ]
\displaystyle\leq 12𝔼i[2𝐇(Wi)2]12subscript𝔼𝑖delimited-[]superscript2subscript𝐇subscript𝑊𝑖superscript2\displaystyle\frac{1}{2}{\mathbb{E}}_{{i}}\left[{\sqrt{2^{-{\mathbf{H}_{\infty}}(W_{i})}2^{\ell}}}\right]divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ square-root start_ARG 2 start_POSTSUPERSCRIPT - bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT end_ARG ]
\displaystyle\leq 12𝔼i[2𝐇(Wi)2]12subscript𝔼𝑖delimited-[]superscript2subscript𝐇subscript𝑊𝑖superscript2\displaystyle\frac{1}{2}\sqrt{{\mathbb{E}}_{{i}}\left[{2^{-{\mathbf{H}_{\infty}}(W_{i})}2^{\ell}}\right]}divide start_ARG 1 end_ARG start_ARG 2 end_ARG square-root start_ARG blackboard_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ 2 start_POSTSUPERSCRIPT - bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT ] end_ARG
=\displaystyle== 122𝐇~(WI)2.12superscript2subscript~𝐇conditional𝑊𝐼superscript2\displaystyle\frac{1}{2}\sqrt{2^{-{\tilde{\mathbf{H}}_{\infty}}(W\mid I)}2^{\ell}}\,.divide start_ARG 1 end_ARG start_ARG 2 end_ARG square-root start_ARG 2 start_POSTSUPERSCRIPT - over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_W ∣ italic_I ) end_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT end_ARG .

In the above derivation, the first inequality follows from the standard Leftover Hash Lemma (Lemma 2.1), and the second inequality follows from Jensen’s inequality (namely, 𝔼[Z]𝔼[Z]𝔼delimited-[]𝑍𝔼delimited-[]𝑍{\mathbb{E}}\left[{\sqrt{Z}}\right]\leq\sqrt{{\mathbb{E}}\left[{Z}\right]}blackboard_E [ square-root start_ARG italic_Z end_ARG ] ≤ square-root start_ARG blackboard_E [ italic_Z ] end_ARG). ∎

3 New Definitions

3.1 Secure Sketches

Let {\cal M}caligraphic_M be a metric space with distance function 𝖽𝗂𝗌𝖽𝗂𝗌{\mathsf{dis}}sansserif_dis.

Definition 3.

An (,m,m~,t)𝑚normal-~𝑚𝑡({\cal M},m,{\tilde{m}},t)( caligraphic_M , italic_m , over~ start_ARG italic_m end_ARG , italic_t )-secure sketch is a pair of randomized procedures, “sketch” (𝖲𝖲𝖲𝖲\mathsf{SS}sansserif_SS) and “recover” (𝖱𝖾𝖼𝖱𝖾𝖼\mathsf{Rec}sansserif_Rec), with the following properties:

  1. 1.

    The sketching procedure 𝖲𝖲𝖲𝖲\mathsf{SS}sansserif_SS on input w𝑤w\in{\cal M}italic_w ∈ caligraphic_M returns a bit string s{0,1}*𝑠superscript01s\in\{0,1\}^{*}italic_s ∈ { 0 , 1 } start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT.

  2. 2.

    The recovery procedure 𝖱𝖾𝖼𝖱𝖾𝖼\mathsf{Rec}sansserif_Rec takes an element wMsuperscript𝑤𝑀w^{\prime}\in Mitalic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_M and a bit string s{0,1}*𝑠superscript01s\in\{0,1\}^{*}italic_s ∈ { 0 , 1 } start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT. The correctness property of secure sketches guarantees that if 𝖽𝗂𝗌(w,w)t𝖽𝗂𝗌𝑤superscript𝑤𝑡{\mathsf{dis}(w,w^{\prime})}\leq tsansserif_dis ( italic_w , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ italic_t, then 𝖱𝖾𝖼(w,𝖲𝖲(w))=w𝖱𝖾𝖼superscript𝑤𝖲𝖲𝑤𝑤\mathsf{Rec}(w^{\prime},\mathsf{SS}(w))=wsansserif_Rec ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , sansserif_SS ( italic_w ) ) = italic_w. If 𝖽𝗂𝗌(w,w)>t𝖽𝗂𝗌𝑤superscript𝑤𝑡{\mathsf{dis}(w,w^{\prime})}>tsansserif_dis ( italic_w , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) > italic_t, then no guarantee is provided about the output of 𝖱𝖾𝖼𝖱𝖾𝖼\mathsf{Rec}sansserif_Rec.

  3. 3.

    The security property guarantees that for any distribution W𝑊Witalic_W over {\cal M}caligraphic_M with min-entropy m𝑚mitalic_m, the value of W𝑊Witalic_W can be recovered by the adversary who observes s𝑠sitalic_s with probability no greater than 2m~superscript2~𝑚2^{-{\tilde{m}}}2 start_POSTSUPERSCRIPT - over~ start_ARG italic_m end_ARG end_POSTSUPERSCRIPT. That is, 𝐇~(W𝖲𝖲(W))m~subscript~𝐇conditional𝑊𝖲𝖲𝑊~𝑚{\tilde{\mathbf{H}}_{\infty}}(W\mid\mathsf{SS}(W))\geq{\tilde{m}}over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_W ∣ sansserif_SS ( italic_W ) ) ≥ over~ start_ARG italic_m end_ARG.

A secure sketch is efficient if 𝖲𝖲𝖲𝖲\mathsf{SS}sansserif_SS and 𝖱𝖾𝖼𝖱𝖾𝖼\mathsf{Rec}sansserif_Rec run in expected polynomial time.

Average-Case Secure Sketches.  In many situations, it may well be that the adversary’s information i𝑖iitalic_i about the password w𝑤witalic_w is probabilistic, so that sometimes i𝑖iitalic_i reveals a lot about w𝑤witalic_w, but most of the time w𝑤witalic_w stays hard to predict even given i𝑖iitalic_i. In this case, the previous definition of secure sketch is hard to apply: it provides no guarantee if 𝐇(W|i)subscript𝐇conditional𝑊𝑖{\mathbf{H}_{\infty}}(W|i)bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_W | italic_i ) is not fixed to at least m𝑚mitalic_m for some bad (but infrequent) values of i𝑖iitalic_i. A more robust definition would provide the same guarantee for all pairs of variables (W,I)𝑊𝐼(W,I)( italic_W , italic_I ) such that predicting the value of W𝑊Witalic_W given the value of I𝐼Iitalic_I is hard. We therefore define an average-case secure sketch as follows:

Definition 4.

An average-case (,m,m~,t)𝑚normal-~𝑚𝑡({\cal M},m,{\tilde{m}},t)( caligraphic_M , italic_m , over~ start_ARG italic_m end_ARG , italic_t )-secure sketch is a secure sketch (as defined in Definition 3) whose security property is strengthened as follows: for any random variables W𝑊Witalic_W over {\cal M}caligraphic_M and I𝐼Iitalic_I over {0,1}*superscript01\{0,1\}^{*}{ 0 , 1 } start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT such that 𝐇~(WI)msubscript~𝐇conditional𝑊𝐼𝑚{\tilde{\mathbf{H}}_{\infty}}(W\mid I)\geq mover~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_W ∣ italic_I ) ≥ italic_m, we have 𝐇~(W(𝖲𝖲(W),I))m~subscript~𝐇conditional𝑊𝖲𝖲𝑊𝐼~𝑚{\tilde{\mathbf{H}}_{\infty}}(W\mid(\mathsf{SS}(W),I))\geq{\tilde{m}}over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_W ∣ ( sansserif_SS ( italic_W ) , italic_I ) ) ≥ over~ start_ARG italic_m end_ARG. Note that an average-case secure sketch is also a secure sketch (take I𝐼Iitalic_I to be empty).

This definition has the advantage that it composes naturally, as shown in Lemma 4.7. All of our constructions will in fact be average-case secure sketches. However, we will often omit the term “average-case” for simplicity of exposition.

Entropy Loss.  The quantity m~~𝑚{\tilde{m}}over~ start_ARG italic_m end_ARG is called the residual (min-)entropy of the secure sketch, and the quantity λ=mm~𝜆𝑚~𝑚{\lambda}=m-{\tilde{m}}italic_λ = italic_m - over~ start_ARG italic_m end_ARG is called the entropy loss of a secure sketch. In analyzing the security of our secure sketch constructions below, we will typically bound the entropy loss regardless of m𝑚mitalic_m, thus obtaining families of secure sketches that work for all m𝑚mitalic_m (in general, [Rey07] shows that the entropy loss of a secure sketch is upperbounded by its entropy loss on the uniform distribution of inputs). Specifically, for a given construction of 𝖲𝖲𝖲𝖲\mathsf{SS}sansserif_SS, 𝖱𝖾𝖼𝖱𝖾𝖼\mathsf{Rec}sansserif_Rec and a given value t𝑡titalic_t, we will get a value λ𝜆{\lambda}italic_λ for the entropy loss, such that, for any m𝑚mitalic_m, (𝖲𝖲,𝖱𝖾𝖼)𝖲𝖲𝖱𝖾𝖼(\mathsf{SS},\mathsf{Rec})( sansserif_SS , sansserif_Rec ) is an (,m,mλ,t)𝑚𝑚𝜆𝑡({\cal M},m,m-{\lambda},t)( caligraphic_M , italic_m , italic_m - italic_λ , italic_t )-secure sketch. In fact, the most common way to obtain such secure sketches would be to bound the entropy loss by the length of the secure sketch 𝖲𝖲(w)𝖲𝖲𝑤\mathsf{SS}(w)sansserif_SS ( italic_w ), as given in the following simple lemma:

Lemma 3.1.

Assume some algorithms 𝖲𝖲𝖲𝖲\mathsf{SS}sansserif_SS and 𝖱𝖾𝖼𝖱𝖾𝖼\mathsf{Rec}sansserif_Rec satisfy the correctness property of a secure sketch for some value of t𝑡titalic_t, and that the output range of 𝖲𝖲𝖲𝖲\mathsf{SS}sansserif_SS has size at most 2λsuperscript2𝜆2^{\lambda}2 start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT (this holds, in particular, if the length of the sketch is bounded by λ𝜆{\lambda}italic_λ). Then, for any min-entropy threshold m𝑚mitalic_m, (𝖲𝖲,𝖱𝖾𝖼)𝖲𝖲𝖱𝖾𝖼(\mathsf{SS},\mathsf{Rec})( sansserif_SS , sansserif_Rec ) form an average-case (,m,mλ,t)𝑚𝑚𝜆𝑡({\cal M},m,m-{\lambda},t)( caligraphic_M , italic_m , italic_m - italic_λ , italic_t )-secure sketch for {\cal M}caligraphic_M. In particular, for any m𝑚mitalic_m, the entropy loss of this construction is at most λ𝜆{\lambda}italic_λ.

Proof.

The result follows immediately from Lemma 2.2(b), since 𝖲𝖲(W)𝖲𝖲𝑊\mathsf{SS}(W)sansserif_SS ( italic_W ) has at most 2λsuperscript2𝜆2^{\lambda}2 start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT values: for any (W,I)𝑊𝐼(W,I)( italic_W , italic_I ), 𝐇~(W(𝖲𝖲(W),I))𝐇~(WI)λsubscript~𝐇conditional𝑊𝖲𝖲𝑊𝐼subscript~𝐇conditional𝑊𝐼𝜆{\tilde{\mathbf{H}}_{\infty}}(W\mid(\mathsf{SS}(W),I))\geq{\tilde{\mathbf{H}}_{\infty}}(W\mid I)-{\lambda}over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_W ∣ ( sansserif_SS ( italic_W ) , italic_I ) ) ≥ over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_W ∣ italic_I ) - italic_λ. ∎

The above observation formalizes the intuition that a good secure sketch should be as short as possible. In particular, a short secure sketch will likely result in a better entropy loss. More discussion about this relation can be found in Section 9.

3.2 Fuzzy Extractors

Definition 5.

An (,m,,t,ϵ)𝑚normal-ℓ𝑡italic-ϵ({\cal M},m,\ell,t,\epsilon)( caligraphic_M , italic_m , roman_ℓ , italic_t , italic_ϵ )-fuzzy extractor is a pair of randomized procedures, “generate” (𝖦𝖾𝗇𝖦𝖾𝗇\mathsf{Gen}sansserif_Gen) and “reproduce” (𝖱𝖾𝗉𝖱𝖾𝗉\mathsf{Rep}sansserif_Rep), with the following properties:

  1. 1.

    The generation procedure 𝖦𝖾𝗇𝖦𝖾𝗇\mathsf{Gen}sansserif_Gen on input w𝑤w\in{\cal M}italic_w ∈ caligraphic_M outputs an extracted string R{0,1}𝑅superscript01R\in\{0,1\}^{\ell}italic_R ∈ { 0 , 1 } start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT and a helper string P{0,1}*𝑃superscript01P\in\{0,1\}^{*}italic_P ∈ { 0 , 1 } start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT.

  2. 2.

    The reproduction procedure 𝖱𝖾𝗉𝖱𝖾𝗉\mathsf{Rep}sansserif_Rep takes an element wMsuperscript𝑤𝑀w^{\prime}\in Mitalic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_M and a bit string P{0,1}*𝑃superscript01P\in\{0,1\}^{*}italic_P ∈ { 0 , 1 } start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT as inputs. The correctness property of fuzzy extractors guarantees that if 𝖽𝗂𝗌(w,w)t𝖽𝗂𝗌𝑤superscript𝑤𝑡{\mathsf{dis}(w,w^{\prime})}\leq tsansserif_dis ( italic_w , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ italic_t and R,P𝑅𝑃R,Pitalic_R , italic_P were generated by (R,P)𝖦𝖾𝗇(w)𝑅𝑃𝖦𝖾𝗇𝑤(R,P)\leftarrow\mathsf{Gen}(w)( italic_R , italic_P ) ← sansserif_Gen ( italic_w ), then 𝖱𝖾𝗉(w,P)=R𝖱𝖾𝗉superscript𝑤𝑃𝑅\mathsf{Rep}(w^{\prime},P)=Rsansserif_Rep ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_P ) = italic_R. If 𝖽𝗂𝗌(w,w)>t𝖽𝗂𝗌𝑤superscript𝑤𝑡{\mathsf{dis}(w,w^{\prime})}>tsansserif_dis ( italic_w , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) > italic_t, then no guarantee is provided about the output of 𝖱𝖾𝗉𝖱𝖾𝗉\mathsf{Rep}sansserif_Rep.

  3. 3.

    The security property guarantees that for any distribution W𝑊Witalic_W on {\cal M}caligraphic_M of min-entropy m𝑚mitalic_m, the string R𝑅Ritalic_R is nearly uniform even for those who observe P𝑃Pitalic_P: if (R,P)𝖦𝖾𝗇(W)𝑅𝑃𝖦𝖾𝗇𝑊({R,P})\leftarrow\mathsf{Gen}(W)( italic_R , italic_P ) ← sansserif_Gen ( italic_W ), then 𝐒𝐃((R,P),(U,P))ϵ𝐒𝐃𝑅𝑃subscript𝑈𝑃italic-ϵ\mathbf{SD}\left({{({R,P}),\allowbreak({U_{\ell},P})}}\right)\allowbreak\leq\epsilonbold_SD ( ( italic_R , italic_P ) , ( italic_U start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_P ) ) ≤ italic_ϵ.

A fuzzy extractor is efficient if 𝖦𝖾𝗇𝖦𝖾𝗇\mathsf{Gen}sansserif_Gen and 𝖱𝖾𝗉𝖱𝖾𝗉\mathsf{Rep}sansserif_Rep run in expected polynomial time.

In other words, fuzzy extractors allow one to extract some randomness R𝑅Ritalic_R from w𝑤witalic_w and then successfully reproduce R𝑅Ritalic_R from any string wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT that is close to w𝑤witalic_w. The reproduction uses the helper string P𝑃Pitalic_P produced during the initial extraction; yet P𝑃Pitalic_P need not remain secret, because R𝑅Ritalic_R looks truly random even given P𝑃Pitalic_P. To justify our terminology, notice that strong extractors (as defined in Section 2) can indeed be seen as “nonfuzzy” analogs of fuzzy extractors, corresponding to t=0𝑡0t=0italic_t = 0, P=X𝑃𝑋P=Xitalic_P = italic_X, and ={0,1}nsuperscript01𝑛{\cal M}=\{0,1\}^{n}caligraphic_M = { 0 , 1 } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT.

We reiterate that the nearly uniform random bits output by a fuzzy extractor can be used in any cryptographic context that requires uniform random bits (e.g., for secret keys). The slight nonuniformity of the bits may decrease security, but by no more than their distance ϵitalic-ϵ\epsilonitalic_ϵ from uniform. By choosing ϵitalic-ϵ\epsilonitalic_ϵ negligibly small (e.g., 280superscript2802^{-80}2 start_POSTSUPERSCRIPT - 80 end_POSTSUPERSCRIPT should be enough in practice), one can make the decrease in security irrelevant.

Similarly to secure sketches, the quantity m𝑚m-\ellitalic_m - roman_ℓ is called the entropy loss of a fuzzy extractor. Also similarly, a more robust definition is that of an average-case fuzzy extractor, which requires that if 𝐇~(WI)msubscript~𝐇conditional𝑊𝐼𝑚{\tilde{\mathbf{H}}_{\infty}}(W\mid I)\geq mover~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_W ∣ italic_I ) ≥ italic_m, then 𝐒𝐃((R,P,I),(U,P,I))ϵ𝐒𝐃𝑅𝑃𝐼subscript𝑈𝑃𝐼italic-ϵ\mathbf{SD}\left({{({R,P,I}),({U_{\ell},P,I})}}\right)\leq\epsilonbold_SD ( ( italic_R , italic_P , italic_I ) , ( italic_U start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_P , italic_I ) ) ≤ italic_ϵ for any auxiliary random variable I𝐼Iitalic_I.

4 Metric-Independent Results

In this section we demonstrate some general results that do not depend on specific metric spaces. They will be helpful in obtaining specific results for particular metric spaces below. In addition to the results in this section, some generic combinatorial lower bounds on secure sketches and fuzzy extractors are contained in Appendix C. We will later use these bounds to show the near-optimality of some of our constructions for the case of uniform inputs.777Although we believe our constructions to be near optimal for nonuniform inputs as well, and our combinatorial bounds in Appendix C are also meaningful for such inputs, at this time we can use these bounds effectively only for uniform inputs.

4.1 Construction of Fuzzy Extractors from Secure Sketches

Not surprisingly, secure sketches are quite useful in constructing fuzzy extractors. Specifically, we construct fuzzy extractors from secure sketches and strong extractors as follows: apply 𝖲𝖲𝖲𝖲\mathsf{SS}sansserif_SS to w𝑤witalic_w to obtain s𝑠sitalic_s, and a strong extractor 𝖤𝗑𝗍𝖤𝗑𝗍\mathsf{Ext}sansserif_Ext with randomness x𝑥xitalic_x to w𝑤witalic_w to obtain R𝑅Ritalic_R. Store (s,x)𝑠𝑥(s,x)( italic_s , italic_x ) as the helper string P𝑃Pitalic_P. To reproduce R𝑅Ritalic_R from wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and P=(s,x)𝑃𝑠𝑥P=(s,x)italic_P = ( italic_s , italic_x ), first use 𝖱𝖾𝖼(w,s)𝖱𝖾𝖼superscript𝑤𝑠\mathsf{Rec}(w^{\prime},s)sansserif_Rec ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s ) to recover w𝑤witalic_w and then 𝖤𝗑𝗍(w,x)𝖤𝗑𝗍𝑤𝑥\mathsf{Ext}(w,x)sansserif_Ext ( italic_w , italic_x ) to get R𝑅Ritalic_R.

[Uncaptioned image]

A few details need to be filled in. First, in order to apply 𝖤𝗑𝗍𝖤𝗑𝗍\mathsf{Ext}sansserif_Ext to w𝑤witalic_w, we will assume that one can represent elements of {\cal M}caligraphic_M using n𝑛nitalic_n bits. Second, since after leaking the secure sketch value s𝑠sitalic_s, the password w𝑤witalic_w has only conditional min-entropy, technically we need to use the average-case strong extractor, as defined in Definition 2. The formal statement is given below.

Lemma 4.1 (Fuzzy Extractors from Sketches).

Assume (𝖲𝖲,𝖱𝖾𝖼)𝖲𝖲𝖱𝖾𝖼(\mathsf{SS},\mathsf{Rec})( sansserif_SS , sansserif_Rec ) is an (,m,m~,t)𝑚normal-~𝑚𝑡({\cal M},m,\allowbreak{\tilde{m}},\allowbreak t)( caligraphic_M , italic_m , over~ start_ARG italic_m end_ARG , italic_t )-secure sketch, and let 𝖤𝗑𝗍𝖤𝗑𝗍\mathsf{Ext}sansserif_Ext be an average-case (n,m~,,ϵ)𝑛normal-~𝑚normal-ℓitalic-ϵ(n,{\tilde{m}},\ell,\epsilon)( italic_n , over~ start_ARG italic_m end_ARG , roman_ℓ , italic_ϵ )-strong extractor. Then the following (𝖦𝖾𝗇,𝖱𝖾𝗉)𝖦𝖾𝗇𝖱𝖾𝗉(\mathsf{Gen},\mathsf{Rep})( sansserif_Gen , sansserif_Rep ) is an (,m,,t,ϵ)𝑚normal-ℓ𝑡italic-ϵ({\cal M},m,\ell,t,\epsilon)( caligraphic_M , italic_m , roman_ℓ , italic_t , italic_ϵ )-fuzzy extractor:

  • 𝖦𝖾𝗇(w;r,x)𝖦𝖾𝗇𝑤𝑟𝑥\mathsf{Gen}(w;r,x)sansserif_Gen ( italic_w ; italic_r , italic_x ): set P=(𝖲𝖲(w;r),x)𝑃𝖲𝖲𝑤𝑟𝑥P=({\mathsf{SS}(w;r),x})italic_P = ( sansserif_SS ( italic_w ; italic_r ) , italic_x ), R=𝖤𝗑𝗍(w;x)𝑅𝖤𝗑𝗍𝑤𝑥R=\mathsf{Ext}(w;x)italic_R = sansserif_Ext ( italic_w ; italic_x ), and output (R,P)𝑅𝑃({R,P})( italic_R , italic_P ).

  • 𝖱𝖾𝗉(w,(s,x))𝖱𝖾𝗉superscript𝑤𝑠𝑥\mathsf{Rep}(w^{\prime},({s,x}))sansserif_Rep ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , ( italic_s , italic_x ) ): recover w=𝖱𝖾𝖼(w,s)𝑤𝖱𝖾𝖼superscript𝑤𝑠w=\mathsf{Rec}(w^{\prime},s)italic_w = sansserif_Rec ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s ) and output R=𝖤𝗑𝗍(w;x)𝑅𝖤𝗑𝗍𝑤𝑥R=\mathsf{Ext}(w;x)italic_R = sansserif_Ext ( italic_w ; italic_x ).

Proof.

From the definition of secure sketch (Definition 3), we know that 𝐇~(W𝖲𝖲(W))m~subscript~𝐇conditional𝑊𝖲𝖲𝑊~𝑚{\tilde{\mathbf{H}}_{\infty}}(W\mid\mathsf{SS}(W))\geq{\tilde{m}}over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_W ∣ sansserif_SS ( italic_W ) ) ≥ over~ start_ARG italic_m end_ARG. And since 𝖤𝗑𝗍𝖤𝗑𝗍\mathsf{Ext}sansserif_Ext is an average-case (n,m~,,ϵ)𝑛~𝑚italic-ϵ(n,{\tilde{m}},\ell,\epsilon)( italic_n , over~ start_ARG italic_m end_ARG , roman_ℓ , italic_ϵ )-strong extractor, 𝐒𝐃((𝖤𝗑𝗍(W;X),𝖲𝖲(W),X),(U,𝖲𝖲(W),X))=𝐒𝐃((R,P),(U,P))ϵ𝐒𝐃𝖤𝗑𝗍𝑊𝑋𝖲𝖲𝑊𝑋subscript𝑈𝖲𝖲𝑊𝑋𝐒𝐃𝑅𝑃subscript𝑈𝑃italic-ϵ\mathbf{SD}\left({{(\mathsf{Ext}(W;X),\mathsf{SS}(W),X),(U_{\ell},\mathsf{SS}(W),X)}}\right)=\mathbf{SD}\left({{(R,P),(U_{\ell},P)}}\right)\leq\epsilonbold_SD ( ( sansserif_Ext ( italic_W ; italic_X ) , sansserif_SS ( italic_W ) , italic_X ) , ( italic_U start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , sansserif_SS ( italic_W ) , italic_X ) ) = bold_SD ( ( italic_R , italic_P ) , ( italic_U start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_P ) ) ≤ italic_ϵ. ∎

On the other hand, if one would like to use a worst-case strong extractor, we can apply Lemma 2.3 to get

Corollary 4.2.

If (𝖲𝖲,𝖱𝖾𝖼)𝖲𝖲𝖱𝖾𝖼(\mathsf{SS},\mathsf{Rec})( sansserif_SS , sansserif_Rec ) is an (,m,m~,t)𝑚normal-~𝑚𝑡({\cal M},m,{\tilde{m}},t)( caligraphic_M , italic_m , over~ start_ARG italic_m end_ARG , italic_t )-secure sketch and 𝖤𝗑𝗍𝖤𝗑𝗍\mathsf{Ext}sansserif_Ext is an (n,m~log(1δ),,ϵ)𝑛normal-~𝑚1𝛿normal-ℓitalic-ϵ(n,{\tilde{m}}-\log\left({\frac{1}{\delta}}\right),\ell,\epsilon)( italic_n , over~ start_ARG italic_m end_ARG - roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) , roman_ℓ , italic_ϵ )-strong extractor, then the above construction (𝖦𝖾𝗇,𝖱𝖾𝗉)𝖦𝖾𝗇𝖱𝖾𝗉(\mathsf{Gen},\mathsf{Rep})( sansserif_Gen , sansserif_Rep ) is a (,m,,t,ϵ+δ)𝑚normal-ℓ𝑡italic-ϵ𝛿({\cal M},m,\ell,t,\epsilon+\delta)( caligraphic_M , italic_m , roman_ℓ , italic_t , italic_ϵ + italic_δ )-fuzzy extractor.

Both Lemma 4.1 and Corollary 4.2 hold (with the same proofs) for building average-case fuzzy extractors from average-case secure sketches.

While the above statements work for general extractors, for our purposes we can simply use universal hashing, since it is an average-case strong extractor that achieves the optimal [RTS00] entropy loss of 2log(1ϵ)21italic-ϵ2\log\left({\frac{1}{\epsilon}}\right)2 roman_log ( divide start_ARG 1 end_ARG start_ARG italic_ϵ end_ARG ). In particular, using Lemma 2.4, we obtain our main corollary:

Lemma 4.3.

If (𝖲𝖲,𝖱𝖾𝖼)𝖲𝖲𝖱𝖾𝖼(\mathsf{SS},\mathsf{Rec})( sansserif_SS , sansserif_Rec ) is an (,m,m~,t)𝑚normal-~𝑚𝑡({\cal M},m,{\tilde{m}},t)( caligraphic_M , italic_m , over~ start_ARG italic_m end_ARG , italic_t )-secure sketch and 𝖤𝗑𝗍𝖤𝗑𝗍\mathsf{Ext}sansserif_Ext is an (n,m~,,ϵ)𝑛normal-~𝑚normal-ℓitalic-ϵ(n,{\tilde{m}},\ell,\allowbreak\epsilon)( italic_n , over~ start_ARG italic_m end_ARG , roman_ℓ , italic_ϵ )-strong extractor given by universal hashing (in particular, any m~2log(1ϵ)+2normal-ℓnormal-~𝑚21italic-ϵ2\ell\leq{\tilde{m}}-2\log\left({\frac{1}{\epsilon}}\right)+2roman_ℓ ≤ over~ start_ARG italic_m end_ARG - 2 roman_log ( divide start_ARG 1 end_ARG start_ARG italic_ϵ end_ARG ) + 2 can be achieved), then the above construction (𝖦𝖾𝗇,𝖱𝖾𝗉)𝖦𝖾𝗇𝖱𝖾𝗉(\mathsf{Gen},\mathsf{Rep})( sansserif_Gen , sansserif_Rep ) is an (,m,,t,ϵ)𝑚normal-ℓ𝑡italic-ϵ({\cal M},m,\ell,t,\epsilon)( caligraphic_M , italic_m , roman_ℓ , italic_t , italic_ϵ )-fuzzy extractor. In particular, one can extract up to (m~2log(1ϵ)+2)normal-~𝑚21italic-ϵ2({\tilde{m}}-2\log\left({\frac{1}{\epsilon}}\right)+2)( over~ start_ARG italic_m end_ARG - 2 roman_log ( divide start_ARG 1 end_ARG start_ARG italic_ϵ end_ARG ) + 2 ) nearly uniform bits from a secure sketch with residual min-entropy m~normal-~𝑚{\tilde{m}}over~ start_ARG italic_m end_ARG.

Again, if the above secure sketch is average-case secure, then so is the resulting fuzzy extractor. In fact, combining the above result with Lemma 3.1, we get the following general construction of average-case fuzzy extractors:

Lemma 4.4.

Assume some algorithms 𝖲𝖲𝖲𝖲\mathsf{SS}sansserif_SS and 𝖱𝖾𝖼𝖱𝖾𝖼\mathsf{Rec}sansserif_Rec satisfy the correctness property of a secure sketch for some value of t𝑡titalic_t, and that the output range of 𝖲𝖲𝖲𝖲\mathsf{SS}sansserif_SS has size at most 2λsuperscript2𝜆2^{\lambda}2 start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT (this holds, in particular, if the length of the sketch is bounded by λ𝜆{\lambda}italic_λ). Then, for any min-entropy threshold m𝑚mitalic_m, there exists an average-case (,m,mλ2log(1ϵ)+2,t,ϵ)𝑚𝑚𝜆21italic-ϵ2𝑡italic-ϵ({\cal M},m,m-{\lambda}-2\log\left({\frac{1}{\epsilon}}\right)+2,t,\epsilon)( caligraphic_M , italic_m , italic_m - italic_λ - 2 roman_log ( divide start_ARG 1 end_ARG start_ARG italic_ϵ end_ARG ) + 2 , italic_t , italic_ϵ )-fuzzy extractor for {\cal M}caligraphic_M. In particular, for any m𝑚mitalic_m, the entropy loss of the fuzzy extractor is at most λ+2log(1ϵ)2𝜆21italic-ϵ2{\lambda}+2\log\left({\frac{1}{\epsilon}}\right)-2italic_λ + 2 roman_log ( divide start_ARG 1 end_ARG start_ARG italic_ϵ end_ARG ) - 2.

4.2 Secure Sketches for Transitive Metric Spaces

We give a general technique for building secure sketches in transitive metric spaces, which we now define. A permutation π𝜋\piitalic_π on a metric space {\cal M}caligraphic_M is an isometry if it preserves distances, i.e., 𝖽𝗂𝗌(a,b)=𝖽𝗂𝗌(π(a),π(b))𝖽𝗂𝗌𝑎𝑏𝖽𝗂𝗌𝜋𝑎𝜋𝑏{\mathsf{dis}(a,b)}={\mathsf{dis}(\pi(a),\pi(b))}sansserif_dis ( italic_a , italic_b ) = sansserif_dis ( italic_π ( italic_a ) , italic_π ( italic_b ) ). A family of permutations Π={πi}iΠsubscriptsubscript𝜋𝑖𝑖\Pi=\left\{{\pi_{i}}\right\}_{i\in{\cal I}}roman_Π = { italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_I end_POSTSUBSCRIPT acts transitively on {\cal M}caligraphic_M if for any two elements a,b𝑎𝑏a,b\in{\cal M}italic_a , italic_b ∈ caligraphic_M, there exists πiΠsubscript𝜋𝑖Π\pi_{i}\in\Piitalic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ roman_Π such that πi(a)=bsubscript𝜋𝑖𝑎𝑏\pi_{i}(a)=bitalic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_a ) = italic_b. Suppose we have a family ΠΠ\Piroman_Π of transitive isometries for {\cal M}caligraphic_M (we will call such {\cal M}caligraphic_M transitive). For example, in the Hamming space, the set of all shifts πx(w)=wxsubscript𝜋𝑥𝑤direct-sum𝑤𝑥\pi_{x}(w)=w\oplus xitalic_π start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_w ) = italic_w ⊕ italic_x is such a family (see Section 5 for more details on this example).

Construction 1 (Secure Sketch For Transitive Metric Spaces).

Let C𝐶Citalic_C be an (,K,t)𝐾𝑡({\cal M},K,t)( caligraphic_M , italic_K , italic_t )-code. Then the general sketching scheme 𝖲𝖲𝖲𝖲\mathsf{SS}sansserif_SS is the following: given an input w𝑤w\in{\cal M}italic_w ∈ caligraphic_M, pick uniformly at random a codeword bC𝑏𝐶b\in Citalic_b ∈ italic_C, pick uniformly at random a permutation πΠ𝜋Π\pi\in\Piitalic_π ∈ roman_Π such that π(w)=b𝜋𝑤𝑏\pi(w)=bitalic_π ( italic_w ) = italic_b, and output 𝖲𝖲(w)=π𝖲𝖲𝑤𝜋\mathsf{SS}(w)=\pisansserif_SS ( italic_w ) = italic_π (it is crucial that each πΠ𝜋Π\pi\in\Piitalic_π ∈ roman_Π should have a canonical description that is independent of how π𝜋\piitalic_π was chosen and, in particular, independent of b𝑏bitalic_b and w𝑤witalic_w; the number of possible outputs of 𝖲𝖲𝖲𝖲\mathsf{SS}sansserif_SS should thus be |Π|Π|\Pi|| roman_Π |). The recovery procedure 𝖱𝖾𝖼𝖱𝖾𝖼\mathsf{Rec}sansserif_Rec to find w𝑤witalic_w given wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and the sketch π𝜋\piitalic_π is as follows: find the closest codeword bsuperscript𝑏b^{\prime}italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to π(w)𝜋superscript𝑤\pi(w^{\prime})italic_π ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), and output π1(b)superscript𝜋1superscript𝑏\pi^{-1}(b^{\prime})italic_π start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ).

Let ΓΓ\Gammaroman_Γ be the number of elements πΠ𝜋Π\pi\in\Piitalic_π ∈ roman_Π such that minw,b|{π|π(w)=b}|Γsubscript𝑤𝑏conditional-set𝜋𝜋𝑤𝑏Γ\min_{w,b}|\{\pi|\pi(w)=b\}|\geq\Gammaroman_min start_POSTSUBSCRIPT italic_w , italic_b end_POSTSUBSCRIPT | { italic_π | italic_π ( italic_w ) = italic_b } | ≥ roman_Γ. I.e., for each w𝑤witalic_w and b𝑏bitalic_b, there are at least ΓΓ\Gammaroman_Γ choices for π𝜋\piitalic_π. Then we obtain the following lemma.

Lemma 4.5.

(𝖲𝖲,𝖱𝖾𝖼)𝖲𝖲𝖱𝖾𝖼(\mathsf{SS},\mathsf{Rec})( sansserif_SS , sansserif_Rec ) is an average-case (,m,mlog|Π|+logΓ+logK,t)𝑚𝑚normal-Πnormal-Γ𝐾𝑡({\cal M},m,m-\log|\Pi|+\log\Gamma+\log K,t)( caligraphic_M , italic_m , italic_m - roman_log | roman_Π | + roman_log roman_Γ + roman_log italic_K , italic_t )-secure sketch. It is efficient if operations on the code, as well as π𝜋\piitalic_π and π1superscript𝜋1\pi^{-1}italic_π start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, can be implemented efficiently.

Proof.

Correctness is clear: when 𝖽𝗂𝗌(w,w)t𝖽𝗂𝗌𝑤superscript𝑤𝑡{\mathsf{dis}(w,w^{\prime})}\leq tsansserif_dis ( italic_w , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ italic_t, then 𝖽𝗂𝗌(b,π(w))t𝖽𝗂𝗌𝑏𝜋superscript𝑤𝑡{\mathsf{dis}(b,\pi(w^{\prime}))}\leq tsansserif_dis ( italic_b , italic_π ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ≤ italic_t, so decoding π(w)𝜋superscript𝑤\pi(w^{\prime})italic_π ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) will result in b=bsuperscript𝑏𝑏b^{\prime}=bitalic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_b, which in turn means that π1(b)=wsuperscript𝜋1superscript𝑏𝑤\pi^{-1}(b^{\prime})=witalic_π start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_w. The intuitive argument for security is as follows: we add logK+logΓ𝐾Γ\log K+\log\Gammaroman_log italic_K + roman_log roman_Γ bits of entropy by choosing b𝑏bitalic_b and π𝜋\piitalic_π, and subtract log|Π|Π\log|\Pi|roman_log | roman_Π | by publishing π𝜋\piitalic_π. Since given π𝜋\piitalic_π, w𝑤witalic_w and b𝑏bitalic_b determine each other, the total entropy loss is log|Π|logKlogΓΠ𝐾Γ\log|\Pi|-\log K-\log\Gammaroman_log | roman_Π | - roman_log italic_K - roman_log roman_Γ. More formally, 𝐇~(W𝖲𝖲(W),I)=𝐇~((W,𝖲𝖲(W))I)log|Π|subscript~𝐇conditional𝑊𝖲𝖲𝑊𝐼subscript~𝐇conditional𝑊𝖲𝖲𝑊𝐼Π{\tilde{\mathbf{H}}_{\infty}}(W\mid\mathsf{SS}(W),I)={\tilde{\mathbf{H}}_{\infty}}((W,\mathsf{SS}(W))\mid I)-\log|\Pi|over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_W ∣ sansserif_SS ( italic_W ) , italic_I ) = over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( ( italic_W , sansserif_SS ( italic_W ) ) ∣ italic_I ) - roman_log | roman_Π | by Lemma 2.2(b). Given a particular value of w𝑤witalic_w, there are K𝐾Kitalic_K equiprobable choices for b𝑏bitalic_b and, further, at least ΓΓ\Gammaroman_Γ equiprobable choices for π𝜋\piitalic_π once b𝑏bitalic_b is picked, and hence any given permutation π𝜋\piitalic_π is chosen with probability at most 1/(KΓ)1𝐾Γ1/(K\Gamma)1 / ( italic_K roman_Γ ) (because different choices for b𝑏bitalic_b result in different choices for π𝜋\piitalic_π). Therefore, for all i𝑖iitalic_i, w𝑤witalic_w, and π𝜋\piitalic_π, Pr[W=w𝖲𝖲(w)=πI=i]Pr[W=wI=i]/(KΓ)Pr𝑊𝑤𝖲𝖲𝑤conditional𝜋𝐼𝑖Pr𝑊conditional𝑤𝐼𝑖𝐾Γ\Pr[W=w\wedge\mathsf{SS}(w)=\pi\mid I=i]\leq\Pr[W=w\mid I=i]/(K\Gamma)roman_Pr [ italic_W = italic_w ∧ sansserif_SS ( italic_w ) = italic_π ∣ italic_I = italic_i ] ≤ roman_Pr [ italic_W = italic_w ∣ italic_I = italic_i ] / ( italic_K roman_Γ ); hence 𝐇~((W,𝖲𝖲(W))I)𝐇~(WI)+logK+logΓsubscript~𝐇conditional𝑊𝖲𝖲𝑊𝐼subscript~𝐇conditional𝑊𝐼𝐾Γ{\tilde{\mathbf{H}}_{\infty}}((W,\mathsf{SS}(W))\mid I)\geq{\tilde{\mathbf{H}}_{\infty}}(W\mid I)+\log K+\log\Gammaover~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( ( italic_W , sansserif_SS ( italic_W ) ) ∣ italic_I ) ≥ over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_W ∣ italic_I ) + roman_log italic_K + roman_log roman_Γ. ∎

Naturally, security loss will be smaller if the code C𝐶Citalic_C is denser.

We will discuss concrete instantiations of this approach in Section 5 and Section 6.1.

4.3 Changing Metric Spaces via Biometric Embeddings

We now introduce a general technique that allows one to build fuzzy extractors and secure sketches in some metric space 1subscript1{\cal M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT from fuzzy extractors and secure sketches in some other metric space 2subscript2{\cal M}_{2}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Below, we let 𝖽𝗂𝗌(,)i𝖽𝗂𝗌subscript𝑖{\mathsf{dis}(\cdot,\cdot)}_{i}sansserif_dis ( ⋅ , ⋅ ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the distance function in isubscript𝑖{\cal M}_{i}caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The technique is to embed 1subscript1{\cal M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT into 2subscript2{\cal M}_{2}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT so as to “preserve” relevant parameters for fuzzy extraction.

Definition 6.

A function f:12:𝑓subscript1subscript2f:{\cal M}_{1}\to{\cal M}_{2}italic_f : caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is called a (t1,t2,m1,m2)subscript𝑡1subscript𝑡2subscript𝑚1subscript𝑚2(t_{1},t_{2},m_{1},m_{2})( italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )-biometric embedding if the following two conditions hold:

  • for any w1,w11subscript𝑤1superscriptsubscript𝑤1subscript1w_{1},w_{1}^{\prime}\in{\cal M}_{1}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT such that 𝖽𝗂𝗌(w1,w1)1t1𝖽𝗂𝗌subscriptsubscript𝑤1superscriptsubscript𝑤11subscript𝑡1{\mathsf{dis}(w_{1},w_{1}^{\prime})}_{1}\leq t_{1}sansserif_dis ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we have 𝖽𝗂𝗌(f(w1),f(w2))2t2𝖽𝗂𝗌subscript𝑓subscript𝑤1𝑓subscript𝑤22subscript𝑡2{\mathsf{dis}(f(w_{1}),f(w_{2}))}_{2}\allowbreak\leq t_{2}sansserif_dis ( italic_f ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_f ( italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

  • for any distribution W1subscript𝑊1W_{1}italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT on 1subscript1{\cal M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT of min-entropy at least m1subscript𝑚1m_{1}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, f(W1)𝑓subscript𝑊1f(W_{1})italic_f ( italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) has min-entropy at least m2subscript𝑚2m_{2}italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

The following lemma is immediate (correctness of the resulting fuzzy extractor follows from the first condition, and security follows from the second):

Lemma 4.6.

If f𝑓fitalic_f is a (t1,t2,m1,m2)subscript𝑡1subscript𝑡2subscript𝑚1subscript𝑚2(t_{1},t_{2},m_{1},m_{2})( italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )-biometric embedding of 1subscript1{\cal M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT into 2subscript2{\cal M}_{2}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and (𝖦𝖾𝗇(),𝖱𝖾𝗉(,))𝖦𝖾𝗇normal-⋅𝖱𝖾𝗉normal-⋅normal-⋅(\mathsf{Gen}(\cdot),\mathsf{Rep}(\cdot,\cdot))( sansserif_Gen ( ⋅ ) , sansserif_Rep ( ⋅ , ⋅ ) ) is an (2,m2,,t2,ϵ)subscript2subscript𝑚2normal-ℓsubscript𝑡2italic-ϵ({\cal M}_{2},\allowbreak m_{2},\allowbreak\ell,\allowbreak t_{2},\allowbreak\epsilon)( caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , roman_ℓ , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_ϵ )-fuzzy extractor, then (𝖦𝖾𝗇(f()),𝖱𝖾𝗉(f(),))𝖦𝖾𝗇𝑓normal-⋅𝖱𝖾𝗉𝑓normal-⋅normal-⋅(\mathsf{Gen}(f(\cdot)),\allowbreak\mathsf{Rep}(f(\cdot),\cdot))( sansserif_Gen ( italic_f ( ⋅ ) ) , sansserif_Rep ( italic_f ( ⋅ ) , ⋅ ) ) is an (1,m1,,t1,ϵ)subscript1subscript𝑚1normal-ℓsubscript𝑡1italic-ϵ({\cal M}_{1},\allowbreak m_{1},\allowbreak\ell,\allowbreak t_{1},\allowbreak\epsilon)( caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_ℓ , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ϵ )-fuzzy extractor.

It is easy to define average-case biometric embeddings (in which 𝐇~(W1I)m1𝐇~(f(W1)I)m2subscript~𝐇conditionalsubscript𝑊1𝐼subscript𝑚1subscript~𝐇conditional𝑓subscript𝑊1𝐼subscript𝑚2{\tilde{\mathbf{H}}_{\infty}}(W_{1}\mid I)\geq m_{1}\Rightarrow{\tilde{\mathbf{H}}_{\infty}}(f(W_{1})\mid I)\geq m_{2}over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∣ italic_I ) ≥ italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⇒ over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_f ( italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∣ italic_I ) ≥ italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT), which would result in an analogous lemma for average-case fuzzy extractors.

For a similar result to hold for secure sketches, we need biometric embeddings with an additional property.

Definition 7.

A function f:12:𝑓subscript1subscript2f:{\cal M}_{1}\to{\cal M}_{2}italic_f : caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is called a (t1,t2,λ)subscript𝑡1subscript𝑡2𝜆(t_{1},t_{2},{\lambda})( italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_λ )-biometric embedding with recovery information g𝑔gitalic_g if:

  • for any w1,w11subscript𝑤1superscriptsubscript𝑤1subscript1w_{1},w_{1}^{\prime}\in{\cal M}_{1}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT such that 𝖽𝗂𝗌(w1,w1)1t1𝖽𝗂𝗌subscriptsubscript𝑤1superscriptsubscript𝑤11subscript𝑡1{\mathsf{dis}(w_{1},w_{1}^{\prime})}_{1}\leq t_{1}sansserif_dis ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we have 𝖽𝗂𝗌(f(w1),f(w2))2t2𝖽𝗂𝗌subscript𝑓subscript𝑤1𝑓subscript𝑤22subscript𝑡2{\mathsf{dis}(f(w_{1}),f(w_{2}))}_{2}\allowbreak\leq t_{2}sansserif_dis ( italic_f ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_f ( italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

  • g:M1{0,1}*:𝑔subscript𝑀1superscript01g:M_{1}\to\{0,1\}^{*}italic_g : italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → { 0 , 1 } start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is a function with range size at most 2λsuperscript2𝜆2^{\lambda}2 start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT, and w1M1subscript𝑤1subscript𝑀1w_{1}\in M_{1}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is uniquely determined by (f(w1),g(w1))𝑓subscript𝑤1𝑔subscript𝑤1(f(w_{1}),g(w_{1}))( italic_f ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_g ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ).

With this definition, we get the following analog of Lemma 4.6.

Lemma 4.7.

Let f𝑓fitalic_f be a (t1,t2,λ)subscript𝑡1subscript𝑡2𝜆(t_{1},t_{2},{\lambda})( italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_λ ) biometric embedding with recovery information g𝑔gitalic_g. Let (𝖲𝖲,𝖱𝖾𝖼)𝖲𝖲𝖱𝖾𝖼(\mathsf{SS},\mathsf{Rec})( sansserif_SS , sansserif_Rec ) be an (2,m1λ,m~2,t2)subscript2subscript𝑚1𝜆subscriptnormal-~𝑚2subscript𝑡2({\cal M}_{2},\allowbreak m_{1}-{\lambda},\allowbreak{\tilde{m}}_{2},\allowbreak t_{2})( caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_λ , over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) average-case secure sketch. Let 𝖲𝖲(w)=(𝖲𝖲(f(w)),g(w))superscript𝖲𝖲normal-′𝑤𝖲𝖲𝑓𝑤𝑔𝑤\mathsf{SS}^{\prime}(w)=(\mathsf{SS}(f(w)),g(w))sansserif_SS start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_w ) = ( sansserif_SS ( italic_f ( italic_w ) ) , italic_g ( italic_w ) ). Let 𝖱𝖾𝖼(w,(s,r))superscript𝖱𝖾𝖼normal-′superscript𝑤normal-′𝑠𝑟\mathsf{Rec}^{\prime}(w^{\prime},(s,r))sansserif_Rec start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , ( italic_s , italic_r ) ) be the function obtained by computing 𝖱𝖾𝖼(w,s)𝖱𝖾𝖼superscript𝑤normal-′𝑠\mathsf{Rec}(w^{\prime},s)sansserif_Rec ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s ) to get f(w)𝑓𝑤f(w)italic_f ( italic_w ) and then inverting (f(w),r)𝑓𝑤𝑟(f(w),r)( italic_f ( italic_w ) , italic_r ) to get w𝑤witalic_w. Then (𝖲𝖲,𝖱𝖾𝖼)superscript𝖲𝖲normal-′superscript𝖱𝖾𝖼normal-′(\mathsf{SS}^{\prime},\mathsf{Rec}^{\prime})( sansserif_SS start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , sansserif_Rec start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is an (1,m1,m~2,t1)subscript1subscript𝑚1subscriptnormal-~𝑚2subscript𝑡1({\cal M}_{1},m_{1},{\tilde{m}}_{2},\allowbreak t_{1})( caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) average-case secure sketch.

Proof.

The correctness of this construction follows immediately from the two properties given in Definition 7. As for security, using Lemma 2.2(b) and the fact that the range of g𝑔gitalic_g has size at most 2λsuperscript2𝜆2^{\lambda}2 start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT, we get that 𝐇~(Wg(W))m1λsubscript~𝐇conditional𝑊𝑔𝑊subscript𝑚1𝜆{\tilde{\mathbf{H}}_{\infty}}(W\mid g(W))\geq m_{1}-{\lambda}over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_W ∣ italic_g ( italic_W ) ) ≥ italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_λ whenever 𝐇(W)m1subscript𝐇𝑊subscript𝑚1{\mathbf{H}_{\infty}}(W)\geq m_{1}bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_W ) ≥ italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Moreover, since W𝑊Witalic_W is uniquely recoverable from f(W)𝑓𝑊f(W)italic_f ( italic_W ) and g(W)𝑔𝑊g(W)italic_g ( italic_W ), it follows that 𝐇~(f(W)g(W))m1λsubscript~𝐇conditional𝑓𝑊𝑔𝑊subscript𝑚1𝜆{\tilde{\mathbf{H}}_{\infty}}(f(W)\mid g(W))\geq m_{1}-{\lambda}over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_f ( italic_W ) ∣ italic_g ( italic_W ) ) ≥ italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_λ as well, whenever 𝐇(W)m1subscript𝐇𝑊subscript𝑚1{\mathbf{H}_{\infty}}(W)\geq m_{1}bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_W ) ≥ italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Using the fact that (𝖲𝖲,𝖱𝖾𝖼)𝖲𝖲𝖱𝖾𝖼(\mathsf{SS},\mathsf{Rec})( sansserif_SS , sansserif_Rec ) is an average-case (2,m1λ,m~2,t2)subscript2subscript𝑚1𝜆subscript~𝑚2subscript𝑡2({\cal M}_{2},\allowbreak m_{1}-{\lambda},\allowbreak{\tilde{m}}_{2},\allowbreak t_{2})( caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_λ , over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) secure sketch, we get that 𝐇~(f(W)(𝖲𝖲(W),g(W)))=𝐇~(f(W)𝖲𝖲(W))m~2subscript~𝐇conditional𝑓𝑊𝖲𝖲𝑊𝑔𝑊subscript~𝐇conditional𝑓𝑊superscript𝖲𝖲𝑊subscript~𝑚2{\tilde{\mathbf{H}}_{\infty}}(f(W)\mid(\mathsf{SS}(W),g(W)))={\tilde{\mathbf{H}}_{\infty}}(f(W)\mid\mathsf{SS}^{\prime}(W))\geq{\tilde{m}}_{2}over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_f ( italic_W ) ∣ ( sansserif_SS ( italic_W ) , italic_g ( italic_W ) ) ) = over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_f ( italic_W ) ∣ sansserif_SS start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_W ) ) ≥ over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Finally, since the application of f𝑓fitalic_f can only reduce min-entropy, 𝐇~(W𝖲𝖲(W))m~2subscript~𝐇conditional𝑊superscript𝖲𝖲𝑊subscript~𝑚2{\tilde{\mathbf{H}}_{\infty}}(W\mid\mathsf{SS}^{\prime}(W))\geq{\tilde{m}}_{2}over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_W ∣ sansserif_SS start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_W ) ) ≥ over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT whenever 𝐇(W)m1subscript𝐇𝑊subscript𝑚1{\mathbf{H}_{\infty}}(W)\geq m_{1}bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_W ) ≥ italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. ∎

As we saw, the proof above critically used the notion of average-case secure sketches. Luckily, all our constructions (for example, those obtained via Lemma 3.1) are average-case, so this subtlety will not matter too much.

We will see the utility of this novel type of embedding in Section 7.

5 Constructions for Hamming Distance

In this section we consider constructions for the space =nsuperscript𝑛{\cal M}={\cal F}^{n}caligraphic_M = caligraphic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT under the Hamming distance metric. Let F=||𝐹F=|{\cal F}|italic_F = | caligraphic_F | and f=log2F𝑓subscript2𝐹f=\log_{2}Fitalic_f = roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_F.

Secure Sketches: The Code-Offset Construction.  For the case of ={0,1}01{\cal F}=\{0,1\}caligraphic_F = { 0 , 1 }, Juels and Wattenberg [JW99] considered a notion of “fuzzy commitment.” 888In their interpretation, one commits to x𝑥xitalic_x by picking a random w𝑤witalic_w and publishing 𝖲𝖲(w;x)𝖲𝖲𝑤𝑥\mathsf{SS}(w;x)sansserif_SS ( italic_w ; italic_x ). Given an [n,k,2t+1]2subscript𝑛𝑘2𝑡12[n,k,2t+1]_{2}[ italic_n , italic_k , 2 italic_t + 1 ] start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT error-correcting code C𝐶Citalic_C (not necessarily linear), they fuzzy-commit to x𝑥xitalic_x by publishing wC(x)direct-sum𝑤𝐶𝑥w\oplus C(x)italic_w ⊕ italic_C ( italic_x ). Their construction can be rephrased in our language to give a very simple construction of secure sketches for general {\cal F}caligraphic_F.

We start with an [n,k,2t+1]subscript𝑛𝑘2𝑡1[n,k,2t+1]_{\cal F}[ italic_n , italic_k , 2 italic_t + 1 ] start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT error-correcting code C𝐶Citalic_C (not necessarily linear). The idea is to use C𝐶Citalic_C to correct errors in w𝑤witalic_w even though w𝑤witalic_w may not be in C𝐶Citalic_C. This is accomplished by shifting the code so that a codeword matches up with w𝑤witalic_w, and storing the shift as the sketch. To do so, we need to view {\cal F}caligraphic_F as an additive cyclic group of order F𝐹Fitalic_F (in the case of most common error-correcting codes, {\cal F}caligraphic_F will anyway be a field).

Construction 2 (Code-Offset Construction).

On input w𝑤witalic_w, select a random codeword c𝑐citalic_c (this is equivalent to choosing a random xk𝑥superscript𝑘x\in{\cal F}^{k}italic_x ∈ caligraphic_F start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and computing C(x)𝐶𝑥C(x)italic_C ( italic_x )), and set 𝖲𝖲(w)𝖲𝖲𝑤\mathsf{SS}(w)sansserif_SS ( italic_w ) to be the shift needed to get from c𝑐citalic_c to w𝑤witalic_w: 𝖲𝖲(w)=wc𝖲𝖲𝑤𝑤𝑐\mathsf{SS}(w)=w-csansserif_SS ( italic_w ) = italic_w - italic_c. Then 𝖱𝖾𝖼(w,s)𝖱𝖾𝖼superscript𝑤𝑠\mathsf{Rec}(w^{\prime},s)sansserif_Rec ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s ) is computed by subtracting the shift s𝑠sitalic_s from wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to get c=wssuperscript𝑐superscript𝑤𝑠c^{\prime}=w^{\prime}-sitalic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_s; decoding csuperscript𝑐c^{\prime}italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to get c𝑐citalic_c (note that because 𝖽𝗂𝗌(w,w)t𝖽𝗂𝗌superscript𝑤𝑤𝑡{\mathsf{dis}(w^{\prime},w)}\leq tsansserif_dis ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_w ) ≤ italic_t, so is 𝖽𝗂𝗌(c,c)𝖽𝗂𝗌superscript𝑐𝑐{\mathsf{dis}(c^{\prime},c)}sansserif_dis ( italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_c )); and computing w𝑤witalic_w by shifting back to get w=c+s𝑤𝑐𝑠w=c+sitalic_w = italic_c + italic_s.

[Uncaptioned image]

In the case of ={0,1}01{\cal F}=\{0,1\}caligraphic_F = { 0 , 1 }, addition and subtraction are the same, and we get that computation of the sketch is the same as the Juels-Wattenberg commitment: 𝖲𝖲(w)=wC(x)𝖲𝖲𝑤direct-sum𝑤𝐶𝑥\mathsf{SS}(w)=w\oplus C(x)sansserif_SS ( italic_w ) = italic_w ⊕ italic_C ( italic_x ). In this case, to recover w𝑤witalic_w given wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and s=𝖲𝖲(w)𝑠𝖲𝖲𝑤s=\mathsf{SS}(w)italic_s = sansserif_SS ( italic_w ), compute c=wssuperscript𝑐direct-sumsuperscript𝑤𝑠c^{\prime}=w^{\prime}\oplus sitalic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊕ italic_s, decode csuperscript𝑐c^{\prime}italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to get c𝑐citalic_c, and compute w=cs𝑤direct-sum𝑐𝑠w=c\oplus sitalic_w = italic_c ⊕ italic_s.

When the code C𝐶Citalic_C is linear, this scheme can be simplified as follows.

Construction 3 (Syndrome Construction).

Set 𝖲𝖲(w)=𝗌𝗒𝗇(w)𝖲𝖲𝑤𝗌𝗒𝗇𝑤\mathsf{SS}(w)={\mathsf{syn}}(w)sansserif_SS ( italic_w ) = sansserif_syn ( italic_w ). To compute 𝖱𝖾𝖼(w,s)𝖱𝖾𝖼superscript𝑤𝑠\mathsf{Rec}(w^{\prime},s)sansserif_Rec ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s ), find the unique vector en𝑒superscript𝑛e\in{\cal F}^{n}italic_e ∈ caligraphic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT of Hamming weight tabsent𝑡\leq t≤ italic_t such that 𝗌𝗒𝗇(e)=𝗌𝗒𝗇(w)s𝗌𝗒𝗇𝑒𝗌𝗒𝗇superscript𝑤𝑠{\mathsf{syn}}(e)={\mathsf{syn}}(w^{\prime})-ssansserif_syn ( italic_e ) = sansserif_syn ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_s, and output w=we𝑤superscript𝑤𝑒w=w^{\prime}-eitalic_w = italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_e.

As explained in Section 2, finding the short error-vector e𝑒eitalic_e from its syndrome is the same as decoding the code. It is easy to see that two constructions above are equivalent: given 𝗌𝗒𝗇(w)𝗌𝗒𝗇𝑤{\mathsf{syn}}(w)sansserif_syn ( italic_w ) one can sample from wc𝑤𝑐w-citalic_w - italic_c by choosing a random string v𝑣vitalic_v with 𝗌𝗒𝗇(v)=𝗌𝗒𝗇(w)𝗌𝗒𝗇𝑣𝗌𝗒𝗇𝑤{\mathsf{syn}}(v)={\mathsf{syn}}(w)sansserif_syn ( italic_v ) = sansserif_syn ( italic_w ); conversely, 𝗌𝗒𝗇(wc)=𝗌𝗒𝗇(w)𝗌𝗒𝗇𝑤𝑐𝗌𝗒𝗇𝑤{\mathsf{syn}}(w-c)={\mathsf{syn}}(w)sansserif_syn ( italic_w - italic_c ) = sansserif_syn ( italic_w ). To show that 𝖱𝖾𝖼𝖱𝖾𝖼\mathsf{Rec}sansserif_Rec finds the correct w𝑤witalic_w, observe that 𝖽𝗂𝗌(we,w)t𝖽𝗂𝗌superscript𝑤𝑒superscript𝑤𝑡{\mathsf{dis}(w^{\prime}-e,w^{\prime})}\leq tsansserif_dis ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_e , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ italic_t by the constraint on the weight of e𝑒eitalic_e, and 𝗌𝗒𝗇(we)=𝗌𝗒𝗇(w)𝗌𝗒𝗇(e)=𝗌𝗒𝗇(w)(𝗌𝗒𝗇(w)s)=s𝗌𝗒𝗇superscript𝑤𝑒𝗌𝗒𝗇superscript𝑤𝗌𝗒𝗇𝑒𝗌𝗒𝗇superscript𝑤𝗌𝗒𝗇superscript𝑤𝑠𝑠{\mathsf{syn}}(w^{\prime}-e)={\mathsf{syn}}(w^{\prime})-{\mathsf{syn}}(e)={\mathsf{syn}}(w^{\prime})-({\mathsf{syn}}(w^{\prime})-s)=ssansserif_syn ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_e ) = sansserif_syn ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - sansserif_syn ( italic_e ) = sansserif_syn ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - ( sansserif_syn ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_s ) = italic_s. There can be only one value within distance t𝑡titalic_t of wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT whose syndrome is s𝑠sitalic_s (else by subtracting two such values we get a codeword that is closer than 2t+12𝑡12t+12 italic_t + 1 to 0, but 0 is also a codeword), so wesuperscript𝑤𝑒w^{\prime}-eitalic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_e must be equal to w𝑤witalic_w.

As mentioned in the introduction, the syndrome construction has appeared before as a component of some cryptographic protocols over quantum and other noisy channels [BBCS91, Cré97], though it has not been analyzed the same way.

Both schemes are (n,m,m(nk)f,t)superscript𝑛𝑚𝑚𝑛𝑘𝑓𝑡({\cal F}^{n},m,m-(n-k)f,t)( caligraphic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_m , italic_m - ( italic_n - italic_k ) italic_f , italic_t ) secure sketches. For the randomized scheme, the intuition for understanding the entropy loss is as follows: we add k𝑘kitalic_k random elements of {\cal F}caligraphic_F and publish n𝑛nitalic_n elements of {\cal F}caligraphic_F. The formal proof is simply Lemma 4.5, because addition in nsuperscript𝑛{\cal F}^{n}caligraphic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is a family of transitive isometries. For the syndrome scheme, this follows from Lemma 3.1, because the syndrome is (nk)𝑛𝑘(n-k)( italic_n - italic_k ) elements of {\cal F}caligraphic_F.

We thus obtain the following theorem.

Theorem 5.1.

Given an [n,k,2t+1]subscript𝑛𝑘2𝑡1[n,k,2t+1]_{\cal F}[ italic_n , italic_k , 2 italic_t + 1 ] start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT error-correcting code, one can construct an average-case (n,m,m(nk)f,t)superscript𝑛𝑚𝑚𝑛𝑘𝑓𝑡({\cal F}^{n},m,\allowbreak m-(n-k)f,t)( caligraphic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_m , italic_m - ( italic_n - italic_k ) italic_f , italic_t ) secure sketch, which is efficient if encoding and decoding are efficient. Furthermore, if the code is linear, then the sketch is deterministic and its output is (nk)𝑛𝑘(n-k)( italic_n - italic_k ) symbols long.

In Appendix C we present some generic lower bounds on secure sketches and fuzzy extractors. Recall that AF(n,d)subscript𝐴𝐹𝑛𝑑A_{F}(n,d)italic_A start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_n , italic_d ) denotes the maximum number K𝐾Kitalic_K of codewords possible in a code of distance d𝑑ditalic_d over n𝑛nitalic_n-character words from an alphabet of size F𝐹Fitalic_F. Then by Lemma C.1, we obtain that the entropy loss of a secure sketch for the Hamming metric is at least nflog2AF(n,2t+1)𝑛𝑓subscript2subscript𝐴𝐹𝑛2𝑡1nf-\log_{2}A_{F}(n,2t+1)italic_n italic_f - roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_n , 2 italic_t + 1 ) when the input is uniform (that is, when m=nf𝑚𝑛𝑓m=nfitalic_m = italic_n italic_f), because K(,t)𝐾𝑡K({\cal M},t)italic_K ( caligraphic_M , italic_t ) from Lemma C.1 is in this case equal to AF(n,2t+1)subscript𝐴𝐹𝑛2𝑡1A_{F}(n,2t+1)italic_A start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_n , 2 italic_t + 1 ) (since a code that corrects t𝑡titalic_t Hamming errors must have minimum distance at least 2t+12𝑡12t+12 italic_t + 1). This means that if the underlying code is optimal (i.e., K=AF(n,2t+1)𝐾subscript𝐴𝐹𝑛2𝑡1K=A_{F}(n,2t+1)italic_K = italic_A start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_n , 2 italic_t + 1 )), then the code-offset construction above is optimal for the case of uniform inputs, because its entropy loss is nflogFKlog2F=nflog2K𝑛𝑓subscript𝐹𝐾subscript2𝐹𝑛𝑓subscript2𝐾nf-\log_{F}K\log_{2}F=nf-\log_{2}Kitalic_n italic_f - roman_log start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT italic_K roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_F = italic_n italic_f - roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_K. Of course, we do not know the exact value of AF(n,d)subscript𝐴𝐹𝑛𝑑A_{F}(n,d)italic_A start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_n , italic_d ), let alone efficiently decodable codes which meet the bound, for many settings of F𝐹Fitalic_F, n𝑛nitalic_n and d𝑑ditalic_d. Nonetheless, the code-offset scheme gets as close to optimality as is possible from coding constraints. If better efficient codes are invented, then better (i.e., lower loss or higher error-tolerance) secure sketches will result.

Fuzzy Extractors.  As a warm-up, consider the case when W𝑊Witalic_W is uniform (m=n𝑚𝑛m=nitalic_m = italic_n) and look at the code-offset sketch construction: v=wC(x)𝑣𝑤𝐶𝑥v=w-C(x)italic_v = italic_w - italic_C ( italic_x ). For 𝖦𝖾𝗇(w)𝖦𝖾𝗇𝑤\mathsf{Gen}(w)sansserif_Gen ( italic_w ), output R=x𝑅𝑥R=xitalic_R = italic_x, P=v𝑃𝑣P=vitalic_P = italic_v. For 𝖱𝖾𝗉(w,P)𝖱𝖾𝗉superscript𝑤𝑃\mathsf{Rep}(w^{\prime},P)sansserif_Rep ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_P ), decode wPsuperscript𝑤𝑃w^{\prime}-Pitalic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_P to obtain C(x)𝐶𝑥C(x)italic_C ( italic_x ) and apply C1superscript𝐶1C^{-1}italic_C start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT to obtain x𝑥xitalic_x. The result, quite clearly, is an (n,nf,kf,t,0)superscript𝑛𝑛𝑓𝑘𝑓𝑡0({\cal F}^{n},nf,kf,t,0)( caligraphic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_n italic_f , italic_k italic_f , italic_t , 0 )-fuzzy extractor, since v𝑣vitalic_v is truly random and independent of x𝑥xitalic_x when w𝑤witalic_w is random. In fact, this is exactly the usage proposed by Juels and Wattenberg [JW99], except they viewed the above fuzzy extractor as a way to use w𝑤witalic_w to “fuzzy commit” to x𝑥xitalic_x, without revealing information about x𝑥xitalic_x.

Unfortunately, the above construction setting R=x𝑅𝑥R=xitalic_R = italic_x works only for uniform W𝑊Witalic_W, since otherwise v𝑣vitalic_v would leak information about x𝑥xitalic_x.

In general, we use the construction in Lemma 4.3 combined with Theorem 5.1 to obtain the following theorem.

Theorem 5.2.

Given any [n,k,2t+1]subscript𝑛𝑘2𝑡1[n,k,2t+1]_{\cal F}[ italic_n , italic_k , 2 italic_t + 1 ] start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT code C𝐶Citalic_C and any m,ϵ𝑚italic-ϵm,\epsilonitalic_m , italic_ϵ, there exists an average-case (,m,,t,ϵ)𝑚normal-ℓ𝑡italic-ϵ({\cal M},m,\ell,\allowbreak t,\epsilon)( caligraphic_M , italic_m , roman_ℓ , italic_t , italic_ϵ )-fuzzy extractor, where =m+kfnf2log(1ϵ)+2normal-ℓ𝑚𝑘𝑓𝑛𝑓21italic-ϵ2\ell=m+kf-nf-2\log\left({\frac{1}{\epsilon}}\right)+2roman_ℓ = italic_m + italic_k italic_f - italic_n italic_f - 2 roman_log ( divide start_ARG 1 end_ARG start_ARG italic_ϵ end_ARG ) + 2. The generation 𝖦𝖾𝗇𝖦𝖾𝗇\mathsf{Gen}sansserif_Gen and recovery 𝖱𝖾𝗉𝖱𝖾𝗉\mathsf{Rep}sansserif_Rep are efficient if C𝐶Citalic_C has efficient encoding and decoding.

6 Constructions for Set Difference

We now turn to inputs that are subsets of a universe 𝒰𝒰{\cal U}caligraphic_U; let n=|𝒰|𝑛𝒰n=|{\cal U}|italic_n = | caligraphic_U |. This corresponds to representing an object by a list of its features. Examples include “minutiae” (ridge meetings and endings) in a fingerprint, short strings which occur in a long document, or lists of favorite movies.

Recall that the distance between two sets w,w𝑤superscript𝑤w,w^{\prime}italic_w , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the size of their symmetric difference: 𝖽𝗂𝗌(w,w)=|ww|𝖽𝗂𝗌𝑤superscript𝑤𝑤superscript𝑤{\mathsf{dis}(w,w^{\prime})}=|w\triangle w^{\prime}|sansserif_dis ( italic_w , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = | italic_w △ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT |. We will denote this metric space by 𝖲𝖣𝗂𝖿(𝒰)𝖲𝖣𝗂𝖿𝒰{\sf SDif}({\cal U})sansserif_SDif ( caligraphic_U ). A set w𝑤witalic_w can be viewed as its characteristic vector in {0,1}nsuperscript01𝑛\{0,1\}^{n}{ 0 , 1 } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, with 1111 at position x𝒰𝑥𝒰x\in{\cal U}italic_x ∈ caligraphic_U if xw𝑥𝑤x\in witalic_x ∈ italic_w, and 00 otherwise. Such representation of sets makes set difference the same as the Hamming metric. However, we will mostly focus on settings where n𝑛nitalic_n is much larger than the size of w𝑤witalic_w, so that representing a set w𝑤witalic_w by n𝑛nitalic_n bits is much less efficient than, say, writing down a list of elements in w𝑤witalic_w, which requires only |w|logn𝑤𝑛|w|\log n| italic_w | roman_log italic_n bits.

Large Versus Small Universes.  More specifically, we will distinguish two broad categories of settings. Let s𝑠sitalic_s denote the size of the sets that are given as inputs to the secure sketch (or fuzzy extractor) algorithms. Most of this section studies situations where the universe size n𝑛nitalic_n is superpolynomial in the set size s𝑠sitalic_s. We call this the “large universe” setting. In contrast, the “small universe” setting refers to situations in which n=𝑝𝑜𝑙𝑦(s)𝑛𝑝𝑜𝑙𝑦𝑠n=\mathit{poly}(s)italic_n = italic_poly ( italic_s ). We want our various constructions to run in polynomial time and use polynomial storage space. In the large universe setting, the n𝑛nitalic_n-bit string representation of a set becomes too large to be usable—we will strive for solutions that are polynomial in s𝑠sitalic_s and logn𝑛\log nroman_log italic_n.

In fact, in many applications—for example, when the input is a list of book titles—it is possible that the actual universe is not only large, but also difficult to enumerate, making it difficult to even find the position in the characteristic vector corresponding to xw𝑥𝑤x\in witalic_x ∈ italic_w. In that case, it is natural to enlarge the universe to a well-understood class—for example, to include all possible strings of a certain length, whether or not they are actual book titles. This has the advantage that the position of x𝑥xitalic_x in the characteristic vector is simply x𝑥xitalic_x itself; however, because the universe is now even larger, the dependence of running time on n𝑛nitalic_n becomes even more important.

Fixed versus Flexible Set Size.  In some situations, all objects are represented by feature sets of exactly the same size s𝑠sitalic_s, while in others the sets may be of arbitrary size. In particular, the original set w𝑤witalic_w and the corrupted set wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT from which we would like to recover the original need not be of the same size. We refer to these two settings as fixed and flexible set size, respectively. When the set size is fixed, the distance 𝖽𝗂𝗌(w,w)𝖽𝗂𝗌𝑤superscript𝑤{\mathsf{dis}(w,w^{\prime})}sansserif_dis ( italic_w , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is always even: 𝖽𝗂𝗌(w,w)=t𝖽𝗂𝗌𝑤superscript𝑤𝑡{\mathsf{dis}(w,w^{\prime})}=tsansserif_dis ( italic_w , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_t if and only if w𝑤witalic_w and wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT agree on exactly st2𝑠𝑡2s-\frac{t}{2}italic_s - divide start_ARG italic_t end_ARG start_ARG 2 end_ARG points. We will denote the restriction of 𝖲𝖣𝗂𝖿(𝒰)𝖲𝖣𝗂𝖿𝒰{\sf SDif}({\cal U})sansserif_SDif ( caligraphic_U ) to s𝑠sitalic_s-element subsets by 𝖲𝖣𝗂𝖿s(𝒰)subscript𝖲𝖣𝗂𝖿𝑠𝒰{\sf SDif}_{s}({\cal U})sansserif_SDif start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( caligraphic_U ).

Summary.  As a point of reference, we will see below that log(ns)logA(n,2t+1,s)binomial𝑛𝑠𝐴𝑛2𝑡1𝑠\log\binom{n}{s}-\log A(n,2t+1,s)roman_log ( FRACOP start_ARG italic_n end_ARG start_ARG italic_s end_ARG ) - roman_log italic_A ( italic_n , 2 italic_t + 1 , italic_s ) is a lower bound on the entropy loss of any secure sketch for set difference (whether or not the set size is fixed). Recall that A(n,2t+1,s)𝐴𝑛2𝑡1𝑠A(n,2t+1,s)italic_A ( italic_n , 2 italic_t + 1 , italic_s ) represents the size of the largest code for Hamming space with minimum distance 2t+12𝑡12t+12 italic_t + 1, in which every word has weight exactly s𝑠sitalic_s. In the large universe setting, where tnmuch-less-than𝑡𝑛t\ll nitalic_t ≪ italic_n, the lower bound is approximately tlogn𝑡𝑛t\log nitalic_t roman_log italic_n. The relevant lower bounds are discussed at the end of Sections 6.1 and 6.2.

In the following sections we will present several schemes which meet this lower bound. The setting of small universes is discussed in Section 6.1. We discuss the code-offset construction (from Section 5), as well as a permutation-based scheme which is tailored to fixed set size. The latter scheme is optimal for this metric, but impractical.

In the remainder of the section, we discuss schemes for the large universe setting. In Section 6.2 we give an improved version of the scheme of Juels and Sudan [JS06]. Our version achieves optimal entropy loss and storage tlogn𝑡𝑛t\log nitalic_t roman_log italic_n for fixed set size (notice the entropy loss doesn’t depend on the set size s𝑠sitalic_s, although the running time does). The new scheme provides an exponential improvement over the original parameters (which are analyzed in Appendix D). Finally, in Section 6.3 we describe how to adapt syndrome decoding algorithms for BCH codes to our application. The resulting scheme, called PinSketch, has optimal storage and entropy loss tlog(n+1)𝑡𝑛1t\log(n+1)italic_t roman_log ( italic_n + 1 ), handles flexible set sizes, and is probably the most practical of the schemes presented here. Another scheme achieving similar parameters (but less efficiently) can be adapted from information reconciliation literature [MTZ03]; see Section 9 for more details.

We do not discuss fuzzy extractors beyond mentioning here that each secure sketch presented in this section can be converted to a fuzzy extractor using Lemma 4.3. We have already seen an example of such conversion in Section 5.

Table 1 summarizes the constructions discussed in this section.

Entropy Loss Storage Time Set Size Notes
Juels-Sudan tlogn+log((nr)/(nsrs))+2𝑡𝑛binomial𝑛𝑟binomial𝑛𝑠𝑟𝑠2t\log n+\log\left({\binom{n}{r}}/{\binom{n-s}{r-s}}\right)+2italic_t roman_log italic_n + roman_log ( ( FRACOP start_ARG italic_n end_ARG start_ARG italic_r end_ARG ) / ( FRACOP start_ARG italic_n - italic_s end_ARG start_ARG italic_r - italic_s end_ARG ) ) + 2 rlogn𝑟𝑛r\log nitalic_r roman_log italic_n poly(rlog(n))𝑝𝑜𝑙𝑦𝑟𝑛poly(r\log(n))italic_p italic_o italic_l italic_y ( italic_r roman_log ( italic_n ) ) Fixed r𝑟ritalic_r is a parameter
[JS06] srn𝑠𝑟𝑛s\leq r\leq nitalic_s ≤ italic_r ≤ italic_n
Generic nlogA(n,2t+1)𝑛𝐴𝑛2𝑡1n-\log A(n,2t+1)italic_n - roman_log italic_A ( italic_n , 2 italic_t + 1 ) nlogA(n,2t+1)𝑛𝐴𝑛2𝑡1n-\log A(n,2t+1)italic_n - roman_log italic_A ( italic_n , 2 italic_t + 1 ) poly(n)𝑝𝑜𝑙𝑦𝑛poly(n)italic_p italic_o italic_l italic_y ( italic_n ) Flexible ent. loss tlog(n)absent𝑡𝑛\approx t\log(n)≈ italic_t roman_log ( italic_n )
syndrome (for linear codes) when tnmuch-less-than𝑡𝑛t\ll nitalic_t ≪ italic_n
Permutation- log(ns)logA(n,2t+1,s)binomial𝑛𝑠𝐴𝑛2𝑡1𝑠\log\binom{n}{s}-\log A(n,2t+1,s)roman_log ( FRACOP start_ARG italic_n end_ARG start_ARG italic_s end_ARG ) - roman_log italic_A ( italic_n , 2 italic_t + 1 , italic_s ) O(nlogn)𝑂𝑛𝑛O(n\log n)italic_O ( italic_n roman_log italic_n ) poly(n)𝑝𝑜𝑙𝑦𝑛poly(n)italic_p italic_o italic_l italic_y ( italic_n ) Fixed ent. loss tlognabsent𝑡𝑛\approx t\log n≈ italic_t roman_log italic_n
based when tnmuch-less-than𝑡𝑛t\ll nitalic_t ≪ italic_n
Improved tlogn𝑡𝑛t\log nitalic_t roman_log italic_n tlogn𝑡𝑛t\log nitalic_t roman_log italic_n poly(slogn)𝑝𝑜𝑙𝑦𝑠𝑛poly(s\log n)italic_p italic_o italic_l italic_y ( italic_s roman_log italic_n ) Fixed
JS
PinSketch tlog(n+1)𝑡𝑛1t\log(n+1)italic_t roman_log ( italic_n + 1 ) tlog(n+1)𝑡𝑛1t\log(n+1)italic_t roman_log ( italic_n + 1 ) poly(slogn)𝑝𝑜𝑙𝑦𝑠𝑛poly(s\log n)italic_p italic_o italic_l italic_y ( italic_s roman_log italic_n ) Flexible See Section 6.3
for running time
Table 1: Summary of Secure Sketches for Set Difference.

6.1 Small Universes

When the universe size is polynomial in s𝑠sitalic_s, there are a number of natural constructions. The most direct one, given previous work, is the construction of Juels and Sudan [JS06]. Unfortunately, that scheme requires a fixed set size and achieves relatively poor parameters (see Appendix D).

We suggest two possible constructions. The first involves representing sets as n𝑛nitalic_n-bit strings and using the constructions of Section 5. The second construction, presented below, requires a fixed set size but achieves slightly improved parameters by going through “constant-weight” codes.

Permutation-based Sketch.  Recall the general construction of Section 4.2 for transitive metric spaces. Let ΠΠ\Piroman_Π be a set of all permutations on 𝒰𝒰{\cal U}caligraphic_U. Given πΠ𝜋Π\pi\in\Piitalic_π ∈ roman_Π, make it a permutation on 𝖲𝖣𝗂𝖿s(𝒰)subscript𝖲𝖣𝗂𝖿𝑠𝒰{\sf SDif}_{s}({\cal U})sansserif_SDif start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( caligraphic_U ) naturally: π(w)={π(x)|xw}𝜋𝑤conditional-set𝜋𝑥𝑥𝑤\pi(w)=\{\pi(x)|x\in w\}italic_π ( italic_w ) = { italic_π ( italic_x ) | italic_x ∈ italic_w }. This makes ΠΠ\Piroman_Π into a family of transitive isometries on 𝖲𝖣𝗂𝖿s(𝒰)subscript𝖲𝖣𝗂𝖿𝑠𝒰{\sf SDif}_{s}({\cal U})sansserif_SDif start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( caligraphic_U ), and thus the results of Section 4.2 apply.

Let C{0,1}n𝐶superscript01𝑛C\subseteq\{0,1\}^{n}italic_C ⊆ { 0 , 1 } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT be any [n,k,2t+1]𝑛𝑘2𝑡1[n,k,2t+1][ italic_n , italic_k , 2 italic_t + 1 ] binary code in which all words have weight exactly s𝑠sitalic_s. Such codes have been studied extensively (see, e.g., [AVZ00, BSSS90] for a summary of known upper and lower bounds). View elements of the code as sets of size s𝑠sitalic_s. We obtain the following scheme, which produces a sketch of length O(nlogn)𝑂𝑛𝑛O(n\log n)italic_O ( italic_n roman_log italic_n ).

Construction 4 (Permutation-Based Sketch).

On input w𝒰𝑤𝒰w\subseteq{\cal U}italic_w ⊆ caligraphic_U of size s𝑠sitalic_s, choose b𝒰𝑏𝒰b\subseteq{\cal U}italic_b ⊆ caligraphic_U at random from the code C𝐶Citalic_C, and choose a random permutation π:𝒰𝒰:𝜋𝒰𝒰\pi:{\cal U}\to{\cal U}italic_π : caligraphic_U → caligraphic_U such that π(w)=b𝜋𝑤𝑏\pi(w)=bitalic_π ( italic_w ) = italic_b (that is, choose a random matching between w𝑤witalic_w and b𝑏bitalic_b and a random matching between 𝒰w𝒰𝑤{\cal U}-wcaligraphic_U - italic_w and 𝒰b𝒰𝑏{\cal U}-bcaligraphic_U - italic_b). Output 𝖲𝖲(w)=π𝖲𝖲𝑤𝜋\mathsf{SS}(w)=\pisansserif_SS ( italic_w ) = italic_π (say, by listing π(1),,π(n)𝜋1𝜋𝑛\pi(1),\dots,\pi(n)italic_π ( 1 ) , … , italic_π ( italic_n )). To recover w𝑤witalic_w from wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT such that 𝖽𝗂𝗌(w,w)t𝖽𝗂𝗌𝑤superscript𝑤𝑡{\mathsf{dis}(w,w^{\prime})}\leq tsansserif_dis ( italic_w , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ italic_t and π𝜋\piitalic_π, compute b=π1(w)superscript𝑏superscript𝜋1superscript𝑤b^{\prime}=\pi^{-1}(w^{\prime})italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_π start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), decode the characteristic vector of bsuperscript𝑏b^{\prime}italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to obtain b𝑏bitalic_b, and output w=π(b)𝑤𝜋𝑏w=\pi(b)italic_w = italic_π ( italic_b ).

This construction is efficient as long as decoding is efficient (everything else takes time O(nlognO(n\log nitalic_O ( italic_n roman_log italic_n)). By Lemma 4.5, its entropy loss is log(ns)kbinomial𝑛𝑠𝑘\log{\binom{n}{s}}-kroman_log ( FRACOP start_ARG italic_n end_ARG start_ARG italic_s end_ARG ) - italic_k: here |Π|=n!Π𝑛|\Pi|=n!| roman_Π | = italic_n ! and Γ=s!(ns)!Γ𝑠𝑛𝑠\Gamma=s!(n-s)!roman_Γ = italic_s ! ( italic_n - italic_s ) !, so log|Π|logΓ=logn!/(s!(ns)!)ΠΓ𝑛𝑠𝑛𝑠\log|\Pi|-\log\Gamma=\log n!/(s!(n-s)!)roman_log | roman_Π | - roman_log roman_Γ = roman_log italic_n ! / ( italic_s ! ( italic_n - italic_s ) ! ).

Comparing the Hamming Scheme with the Permutation Scheme.  The code-offset construction was shown to have entropy loss nlogA(n,2t+1)𝑛𝐴𝑛2𝑡1n-\log A(n,2t+1)italic_n - roman_log italic_A ( italic_n , 2 italic_t + 1 ) if an optimal code is used; the random permutation scheme has entropy loss log(ns)logA(n,2t+1,s)binomial𝑛𝑠𝐴𝑛2𝑡1𝑠\log{\binom{n}{s}}-\log A(n,2t+1,s)roman_log ( FRACOP start_ARG italic_n end_ARG start_ARG italic_s end_ARG ) - roman_log italic_A ( italic_n , 2 italic_t + 1 , italic_s ) for an optimal code. The Bassalygo-Elias inequality (see [vL92]) shows that the bound on the random permutation scheme is always at least as good as the bound on the code offset scheme: A(n,d)2nA(n,d,s)(ns)1𝐴𝑛𝑑superscript2𝑛𝐴𝑛𝑑𝑠superscriptbinomial𝑛𝑠1A(n,d)\cdot 2^{-n}\leq A(n,d,s)\cdot{\binom{n}{s}}^{-1}italic_A ( italic_n , italic_d ) ⋅ 2 start_POSTSUPERSCRIPT - italic_n end_POSTSUPERSCRIPT ≤ italic_A ( italic_n , italic_d , italic_s ) ⋅ ( FRACOP start_ARG italic_n end_ARG start_ARG italic_s end_ARG ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. This implies that nlogA(n,d)log(ns)logA(n,d,s)𝑛𝐴𝑛𝑑binomial𝑛𝑠𝐴𝑛𝑑𝑠n-\log A(n,d)\geq\log{\binom{n}{s}}-\log A(n,d,s)italic_n - roman_log italic_A ( italic_n , italic_d ) ≥ roman_log ( FRACOP start_ARG italic_n end_ARG start_ARG italic_s end_ARG ) - roman_log italic_A ( italic_n , italic_d , italic_s ). Moreover, standard packing arguments give better constructions of constant-weight codes than they do of ordinary codes. 999This comes from the fact that the intersection of a ball of radius d𝑑ditalic_d with the set of all words of weight s𝑠sitalic_s is much smaller than the ball of radius d𝑑ditalic_d itself. In fact, the random permutations scheme is optimal for this metric, just as the code-offset scheme is optimal for the Hamming metric.

We show this as follows. Restrict t𝑡titalic_t to be even, because 𝖽𝗂𝗌(w,w)𝖽𝗂𝗌𝑤superscript𝑤{\mathsf{dis}(w,w^{\prime})}sansserif_dis ( italic_w , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is always even if |w|=|w|𝑤superscript𝑤|w|=|w^{\prime}|| italic_w | = | italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT |. Then the minimum distance of a code over 𝖲𝖣𝗂𝖿s(𝒰)subscript𝖲𝖣𝗂𝖿𝑠𝒰{\sf SDif}_{s}({\cal U})sansserif_SDif start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( caligraphic_U ) that corrects up to t𝑡titalic_t errors must be at least 2t+12𝑡12t+12 italic_t + 1.Indeed, suppose not. Then take two codewords, c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT such that 𝖽𝗂𝗌(c1,c2)2t𝖽𝗂𝗌subscript𝑐1subscript𝑐22𝑡{\mathsf{dis}(c_{1},c_{2})}\leq 2tsansserif_dis ( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ≤ 2 italic_t. There are k𝑘kitalic_k elements in c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT that are not in c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (call their set c1c2subscript𝑐1subscript𝑐2c_{1}-c_{2}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) and k𝑘kitalic_k elements in c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT that are not in c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (call their set c2c1subscript𝑐2subscript𝑐1c_{2}-c_{1}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT), with kt𝑘𝑡k\leq titalic_k ≤ italic_t. Starting with c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, remove t/2𝑡2t/2italic_t / 2 elements of c1c2subscript𝑐1subscript𝑐2c_{1}-c_{2}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and add t/2𝑡2t/2italic_t / 2 elements of c2c1subscript𝑐2subscript𝑐1c_{2}-c_{1}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to obtain a set w𝑤witalic_w (note that here we are using that t𝑡titalic_t is even; if k<t/2𝑘𝑡2k<t/2italic_k < italic_t / 2, then use k𝑘kitalic_k elements). Then 𝖽𝗂𝗌(c1,w)t𝖽𝗂𝗌subscript𝑐1𝑤𝑡{\mathsf{dis}(c_{1},w)}\leq tsansserif_dis ( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w ) ≤ italic_t and 𝖽𝗂𝗌(c2,w)t𝖽𝗂𝗌subscript𝑐2𝑤𝑡{\mathsf{dis}(c_{2},w)}\leq tsansserif_dis ( italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_w ) ≤ italic_t, and so if the received word is w𝑤witalic_w, the receiver cannot be certain whether the sent word was c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT or c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and hence cannot correct t𝑡titalic_t errors.

Therefore by Lemma C.1, we get that the entropy loss of a secure sketch must be at least log(ns)logA(n,2t+1,s)binomial𝑛𝑠𝐴𝑛2𝑡1𝑠\log{\binom{n}{s}}-\log A(n,2t+1,s)roman_log ( FRACOP start_ARG italic_n end_ARG start_ARG italic_s end_ARG ) - roman_log italic_A ( italic_n , 2 italic_t + 1 , italic_s ) in the case of a uniform input w𝑤witalic_w. Thus in principle, it is better to use the random permutation scheme. Nonetheless, there are caveats. First, we do not know of explicitly constructed constant-weight codes that beat the Elias-Bassalygo inequality and would thus lead to better entropy loss for the random permutation scheme than for the Hamming scheme (see [BSSS90] for more on constructions of constant-weight codes and [AVZ00] for upper bounds). Second, much more is known about efficient implementation of decoding for ordinary codes than for constant-weight codes; for example, one can find off-the-shelf hardware and software for decoding many binary codes. In practice, the Hamming-based scheme is likely to be more useful.

6.2 Improving the Construction of Juels and Sudan

We now turn to the large universe setting, where n𝑛nitalic_n is superpolynomial in the set size s𝑠sitalic_s, and we would like operations to be polynomial in s𝑠sitalic_s and logn𝑛\log nroman_log italic_n.

Juels and Sudan [JS06] proposed a secure sketch for the set difference metric with fixed set size (called a “fuzzy vault” in that paper). We present their original scheme here with an analysis of the entropy loss in Appendix D. In particular, our analysis shows that the original scheme has good entropy loss only when the storage space is very large.

We suggest an improved version of the Juels-Sudan scheme which is simpler and achieves much better parameters. The entropy loss and storage space of the new scheme are both tlogn𝑡𝑛t\log nitalic_t roman_log italic_n, which is optimal. (The same parameters are also achieved by the BCH-based construction PinSketch in Section 6.3.) Our scheme has the advantage of being even simpler to analyze, and the computations are simpler. As with the original Juels-Sudan scheme, we assume n=|𝒰|𝑛𝒰n=|{\cal U}|italic_n = | caligraphic_U | is a prime power and work over =𝐺𝐹(n)𝐺𝐹𝑛{\cal F}=\mathit{GF}(n)caligraphic_F = italic_GF ( italic_n ).

An intuition for the scheme is that the numbers ys+1,,yrsubscript𝑦𝑠1subscript𝑦𝑟y_{s+1},\dots,y_{r}italic_y start_POSTSUBSCRIPT italic_s + 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT from the JS scheme need not be chosen at random. One can instead evaluate them as yi=p(xi)subscript𝑦𝑖superscript𝑝subscript𝑥𝑖y_{i}=p^{\prime}(x_{i})italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) for some polynomial psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. One can then represent the entire list of pairs (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) implicitly, using only a few of the coefficients of psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. The new sketch is deterministic (this was not the case for our preliminary version in [DRS04]). Its implementation is available [HJR06].

Construction 5 (Improved JS Secure Sketch for Sets of Size s𝑠sitalic_s).

To compute 𝖲𝖲(w)𝖲𝖲𝑤\mathsf{SS}(w)sansserif_SS ( italic_w ):

  • 1.

    Let p()superscript𝑝p^{\prime}()italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ) be the unique monic polynomial of degree exactly s𝑠sitalic_s such that p(x)=0superscript𝑝𝑥0p^{\prime}(x)=0italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) = 0 for all xw𝑥𝑤x\in witalic_x ∈ italic_w.
    (That is, let p(z)=defxw(zx)superscriptdefsuperscript𝑝𝑧subscriptproduct𝑥𝑤𝑧𝑥p^{\prime}(z)\stackrel{{\scriptstyle\rm def}}{{=}}\prod_{x\in w}(z-x)italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z ) start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG roman_def end_ARG end_RELOP ∏ start_POSTSUBSCRIPT italic_x ∈ italic_w end_POSTSUBSCRIPT ( italic_z - italic_x ).)

  • 2.

    Output the coefficients of p()superscript𝑝p^{\prime}()italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ) of degree s1𝑠1s-1italic_s - 1 down to st𝑠𝑡s-titalic_s - italic_t.
    This is equivalent to computing and outputting the first t𝑡titalic_t symmetric polynomials of the values in A𝐴Aitalic_A; i.e., if w={x1,,xs}𝑤subscript𝑥1subscript𝑥𝑠w=\left\{{x_{1},\dots,x_{s}}\right\}italic_w = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT }, then output

    ixi,ijxixj,,S[s],|S|=t(iSxi).subscript𝑖subscript𝑥𝑖subscript𝑖𝑗subscript𝑥𝑖subscript𝑥𝑗subscriptformulae-sequence𝑆delimited-[]𝑠𝑆𝑡subscriptproduct𝑖𝑆subscript𝑥𝑖\sum_{i}x_{i},\ \sum_{i\neq j}x_{i}x_{j},\ \ldots,\ \sum_{S\subseteq[s],|S|=t}\left({\prod_{i\in S}x_{i}}\right).∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ∑ start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , … , ∑ start_POSTSUBSCRIPT italic_S ⊆ [ italic_s ] , | italic_S | = italic_t end_POSTSUBSCRIPT ( ∏ start_POSTSUBSCRIPT italic_i ∈ italic_S end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

To compute 𝖱𝖾𝖼(w,p)𝖱𝖾𝖼superscript𝑤superscript𝑝\mathsf{Rec}(w^{\prime},p^{\prime})sansserif_Rec ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), where w={a1,a2,,as}superscript𝑤subscript𝑎1subscript𝑎2subscript𝑎𝑠w^{\prime}=\{a_{1},a_{2},\dots,a_{s}\}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = { italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT },

  • 1.

    Create a new polynomial phighsubscript𝑝highp_{\mathrm{high}}italic_p start_POSTSUBSCRIPT roman_high end_POSTSUBSCRIPT, of degree s𝑠sitalic_s which shares the top t+1𝑡1t+1italic_t + 1 coefficients of psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT; that is, let phigh(z)=defzs+i=sts1aizisuperscriptdefsubscript𝑝high𝑧superscript𝑧𝑠superscriptsubscript𝑖𝑠𝑡𝑠1subscript𝑎𝑖superscript𝑧𝑖p_{\mathrm{high}}(z)\stackrel{{\scriptstyle\rm def}}{{=}}z^{s}+\sum_{i=s-t}^{s-1}a_{i}z^{i}italic_p start_POSTSUBSCRIPT roman_high end_POSTSUBSCRIPT ( italic_z ) start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG roman_def end_ARG end_RELOP italic_z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = italic_s - italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s - 1 end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT.

  • 2.

    Evaluate phighsubscript𝑝highp_{\mathrm{high}}italic_p start_POSTSUBSCRIPT roman_high end_POSTSUBSCRIPT on all points in wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to obtain s𝑠sitalic_s pairs (ai,bi)subscript𝑎𝑖subscript𝑏𝑖(a_{i},b_{i})( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

  • 3.

    Use [s,st,t+1]𝒰subscript𝑠𝑠𝑡𝑡1𝒰[s,s-t,t+1]_{\cal U}[ italic_s , italic_s - italic_t , italic_t + 1 ] start_POSTSUBSCRIPT caligraphic_U end_POSTSUBSCRIPT Reed-Solomon decoding (see, e.g., [Bla83, vL92]) to search for a polynomial plowsubscript𝑝lowp_{\mathrm{low}}italic_p start_POSTSUBSCRIPT roman_low end_POSTSUBSCRIPT of degree st1𝑠𝑡1s-t-1italic_s - italic_t - 1 such that plow(ai)=bisubscript𝑝lowsubscript𝑎𝑖subscript𝑏𝑖p_{\mathrm{low}}(a_{i})=b_{i}italic_p start_POSTSUBSCRIPT roman_low end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for at least st/2𝑠𝑡2s-t/2italic_s - italic_t / 2 of the aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT values. If no such polynomial exists, then stop and output “fail.”

  • 4.

    Output the list of zeroes (roots) of the polynomial phighplowsubscript𝑝highsubscript𝑝lowp_{\mathrm{high}}-p_{\mathrm{low}}italic_p start_POSTSUBSCRIPT roman_high end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT roman_low end_POSTSUBSCRIPT (see, e.g., [Sho05] for root-finding algorithms; they can be sped up by first factoring out the known roots—namely, (zai)𝑧subscript𝑎𝑖(z-a_{i})( italic_z - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) for the st/2𝑠𝑡2s-t/2italic_s - italic_t / 2 values of aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that were not deemed erroneous in the previous step).

To see that this secure sketch can tolerate t𝑡titalic_t set difference errors, suppose 𝖽𝗂𝗌(w,w)t𝖽𝗂𝗌𝑤superscript𝑤𝑡{\mathsf{dis}(w,w^{\prime})}\leq tsansserif_dis ( italic_w , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ italic_t. Let psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT be as in the sketch algorithm; that is, p(z)=xw(zx)superscript𝑝𝑧subscriptproduct𝑥𝑤𝑧𝑥p^{\prime}(z)=\prod_{x\in w}(z-x)italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z ) = ∏ start_POSTSUBSCRIPT italic_x ∈ italic_w end_POSTSUBSCRIPT ( italic_z - italic_x ). The polynomial psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is monic; that is, its leading term is zssuperscript𝑧𝑠z^{s}italic_z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT. We can divide the remaining coefficients into two groups: the high coefficients, denoted ast,,as1subscript𝑎𝑠𝑡subscript𝑎𝑠1a_{s-t},\dots,a_{s-1}italic_a start_POSTSUBSCRIPT italic_s - italic_t end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT, and the low coefficients, denoted b1,,bst1subscript𝑏1subscript𝑏𝑠𝑡1b_{1},\dots,b_{s-t-1}italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_b start_POSTSUBSCRIPT italic_s - italic_t - 1 end_POSTSUBSCRIPT:

p(z)=zs+i=sts1aiziphigh(z)+i=0st1biziq(z).p^{\prime}(z)=\qquad\underbrace{z^{s}+\sum_{i=s-t}^{s-1}a_{i}z^{i}}_{p_{\mathrm{high}}(z)}\qquad+\qquad\underbrace{\sum_{i=0}^{s-t-1}b_{i}z^{i}}_{q(z)}\,.italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z ) = under⏟ start_ARG italic_z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = italic_s - italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s - 1 end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT roman_high end_POSTSUBSCRIPT ( italic_z ) end_POSTSUBSCRIPT + under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s - italic_t - 1 end_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT italic_q ( italic_z ) end_POSTSUBSCRIPT .

We can write psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as phigh+qsubscript𝑝high𝑞p_{\mathrm{high}}+qitalic_p start_POSTSUBSCRIPT roman_high end_POSTSUBSCRIPT + italic_q, where q𝑞qitalic_q has degree st1𝑠𝑡1s-t-1italic_s - italic_t - 1. The recovery algorithm gets the coefficients of phighsubscript𝑝highp_{\mathrm{high}}italic_p start_POSTSUBSCRIPT roman_high end_POSTSUBSCRIPT as input. For any point x𝑥xitalic_x in w𝑤witalic_w, we have 0=p(x)=phigh(x)+q(x)0superscript𝑝𝑥subscript𝑝high𝑥𝑞𝑥0=p^{\prime}(x)=p_{\mathrm{high}}(x)+q(x)0 = italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) = italic_p start_POSTSUBSCRIPT roman_high end_POSTSUBSCRIPT ( italic_x ) + italic_q ( italic_x ). Thus, phighsubscript𝑝highp_{\mathrm{high}}italic_p start_POSTSUBSCRIPT roman_high end_POSTSUBSCRIPT and q𝑞-q- italic_q agree at all points in w𝑤witalic_w. Since the set w𝑤witalic_w intersects wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in at least st/2𝑠𝑡2s-t/2italic_s - italic_t / 2 points, the polynomial q𝑞-q- italic_q satisfies the conditions of Step 3 in 𝖱𝖾𝖼𝖱𝖾𝖼\mathsf{Rec}sansserif_Rec. That polynomial is unique, since no two distinct polynomials of degree st1𝑠𝑡1s-t-1italic_s - italic_t - 1 can get the correct bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT on more than st/2𝑠𝑡2s-t/2italic_s - italic_t / 2 aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTs (else, they agree on at least st𝑠𝑡s-titalic_s - italic_t points, which is impossible). Therefore, the recovered polynomial plowsubscript𝑝lowp_{\mathrm{low}}italic_p start_POSTSUBSCRIPT roman_low end_POSTSUBSCRIPT must be q𝑞-q- italic_q; hence phigh(x)plow(x)=p(x)subscript𝑝high𝑥subscript𝑝low𝑥superscript𝑝𝑥p_{\mathrm{high}}(x)-p_{\mathrm{low}}(x)=p^{\prime}(x)italic_p start_POSTSUBSCRIPT roman_high end_POSTSUBSCRIPT ( italic_x ) - italic_p start_POSTSUBSCRIPT roman_low end_POSTSUBSCRIPT ( italic_x ) = italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ). Thus, 𝖱𝖾𝖼𝖱𝖾𝖼\mathsf{Rec}sansserif_Rec computes the correct psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and therefore correctly finds the set w𝑤witalic_w, which consists of the roots of psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

Since the output of 𝖲𝖲𝖲𝖲\mathsf{SS}sansserif_SS is t𝑡titalic_t field elements, the entropy loss of the scheme is at most tlogn𝑡𝑛t\log nitalic_t roman_log italic_n by Lemma 3.1. (We will see below that this bound is tight, since any sketch must lose at least tlogn𝑡𝑛t\log nitalic_t roman_log italic_n in some situations.) We have proved:

Theorem 6.1 (Analysis of Improved JS).

Construction 5 is an average-case (𝖲𝖣𝗂𝖿s(𝒰),m,mtlogn,t)subscript𝖲𝖣𝗂𝖿𝑠𝒰𝑚𝑚𝑡𝑛𝑡({\sf SDif}_{s}({\cal U}),m,m-t\log n,t)( sansserif_SDif start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( caligraphic_U ) , italic_m , italic_m - italic_t roman_log italic_n , italic_t ) secure sketch. The entropy loss and storage of the scheme are at most tlogn𝑡𝑛t\log nitalic_t roman_log italic_n, and both the sketch generation 𝖲𝖲()𝖲𝖲\mathsf{SS}()sansserif_SS ( ) and the recovery procedure 𝖱𝖾𝖼()𝖱𝖾𝖼\mathsf{Rec}()sansserif_Rec ( ) run in time polynomial in s𝑠sitalic_s, t𝑡titalic_t and logn𝑛\log nroman_log italic_n.

Lower Bounds for Fixed Set Size in a Large Universe.  The short length of the sketch makes this scheme feasible for essentially any ratio of set size to universe size (we only need logn𝑛\log nroman_log italic_n to be polynomial in s𝑠sitalic_s). Moreover, for large universes the entropy loss tlogn𝑡𝑛t\log nitalic_t roman_log italic_n is essentially optimal for uniform inputs (i.e., when m=log(ns)𝑚binomial𝑛𝑠m=\log{\binom{n}{s}}italic_m = roman_log ( FRACOP start_ARG italic_n end_ARG start_ARG italic_s end_ARG )). We show this as follows. As already mentioned in the Section 6.1, Lemma C.1 shows that for a uniformly distributed input, the best possible entropy loss is mmlog(ns)logA(n,2t+1,s)𝑚superscript𝑚binomial𝑛𝑠𝐴𝑛2𝑡1𝑠m-m^{\prime}\geq\log{\binom{n}{s}}-\log A(n,2t+1,s)italic_m - italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≥ roman_log ( FRACOP start_ARG italic_n end_ARG start_ARG italic_s end_ARG ) - roman_log italic_A ( italic_n , 2 italic_t + 1 , italic_s ).

By Theorem 12 of Agrell et al. [AVZ00], A(n,2t+2,s)(nst)(sst)𝐴𝑛2𝑡2𝑠binomial𝑛𝑠𝑡binomial𝑠𝑠𝑡A(n,2t+2,s)\leq\frac{\binom{n}{s-t}}{\binom{s}{{s-t}}}italic_A ( italic_n , 2 italic_t + 2 , italic_s ) ≤ divide start_ARG ( FRACOP start_ARG italic_n end_ARG start_ARG italic_s - italic_t end_ARG ) end_ARG start_ARG ( FRACOP start_ARG italic_s end_ARG start_ARG italic_s - italic_t end_ARG ) end_ARG. Noting that A(n,2t+1,s)=A(n,2t+2,s)𝐴𝑛2𝑡1𝑠𝐴𝑛2𝑡2𝑠A(n,2t+1,s)=A(n,2t+2,s)italic_A ( italic_n , 2 italic_t + 1 , italic_s ) = italic_A ( italic_n , 2 italic_t + 2 , italic_s ) because distances in 𝖲𝖣𝗂𝖿s(𝒰)subscript𝖲𝖣𝗂𝖿𝑠𝒰{\sf SDif}_{s}({\cal U})sansserif_SDif start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( caligraphic_U ) are even, the entropy loss is at least

mmlog(ns)logA(n,2t+1,s)log(ns)log((nst)/(sst))=log(ns+tt).𝑚superscript𝑚binomial𝑛𝑠𝐴𝑛2𝑡1𝑠binomial𝑛𝑠binomial𝑛𝑠𝑡binomial𝑠𝑠𝑡binomial𝑛𝑠𝑡𝑡m-m^{\prime}\geq\log{\binom{n}{s}}-\log A(n,2t+1,s)\geq\log{\binom{n}{s}}-\log\left({{\binom{n}{s-t}}\Big{/}{\binom{s}{s-t}}}\right)=\log{\binom{n-s+t}{t}}\,.italic_m - italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≥ roman_log ( FRACOP start_ARG italic_n end_ARG start_ARG italic_s end_ARG ) - roman_log italic_A ( italic_n , 2 italic_t + 1 , italic_s ) ≥ roman_log ( FRACOP start_ARG italic_n end_ARG start_ARG italic_s end_ARG ) - roman_log ( ( FRACOP start_ARG italic_n end_ARG start_ARG italic_s - italic_t end_ARG ) / ( FRACOP start_ARG italic_s end_ARG start_ARG italic_s - italic_t end_ARG ) ) = roman_log ( FRACOP start_ARG italic_n - italic_s + italic_t end_ARG start_ARG italic_t end_ARG ) .

When nsmuch-greater-than𝑛𝑠n\gg sitalic_n ≫ italic_s, this last quantity is roughly tlogn𝑡𝑛t\log nitalic_t roman_log italic_n, as desired.

6.3 Large Universes via the Hamming Metric: Sublinear-Time Decoding

In this section, we show that the syndrome construction of Section 5 can in fact be adapted for small sets in a large universe, using specific properties of algebraic codes. We will show that BCH codes, which contain Hamming and Reed-Solomon codes as special cases, have these properties. As opposed to the constructions of the previous section, the construction of this section is flexible and can accept input sets of any size.

Thus we obtain a sketch for sets of flexible size, with entropy loss and storage tlog(n+1)𝑡𝑛1t\log(n+1)italic_t roman_log ( italic_n + 1 ). We will assume that n𝑛nitalic_n is one less than a power of 2: n=2m1𝑛superscript2𝑚1n=2^{m}-1italic_n = 2 start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT - 1 for some integer m𝑚mitalic_m, and will identify 𝒰𝒰{\cal U}caligraphic_U with the nonzero elements of the binary finite field of degree m𝑚mitalic_m: 𝒰=𝐺𝐹(2m)*𝒰𝐺𝐹superscriptsuperscript2𝑚{\cal U}=\mathit{GF}(2^{m})^{*}caligraphic_U = italic_GF ( 2 start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT.

Syndrome Manipulation for Small-Weight Words.  Suppose now that we have a small set w𝒰𝑤𝒰w\subseteq{\cal U}italic_w ⊆ caligraphic_U of size s𝑠sitalic_s, where nsmuch-greater-than𝑛𝑠n\gg sitalic_n ≫ italic_s. Let xwsubscript𝑥𝑤x_{w}italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT denote the characteristic vector of w𝑤witalic_w (see the beginning of Section 6). Then the syndrome construction says that 𝖲𝖲(w)=𝗌𝗒𝗇(xw)𝖲𝖲𝑤𝗌𝗒𝗇subscript𝑥𝑤\mathsf{SS}(w)={\mathsf{syn}}(x_{w})sansserif_SS ( italic_w ) = sansserif_syn ( italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ). This is an (nk)𝑛𝑘(n-k)( italic_n - italic_k )-bit quantity. Note that the syndrome construction gives us no special advantage over the code-offset construction when the universe is small: storing the n𝑛nitalic_n-bit xw+C(r)subscript𝑥𝑤𝐶𝑟x_{w}+C(r)italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT + italic_C ( italic_r ) for a random k𝑘kitalic_k-bit r𝑟ritalic_r is not a problem. However, it’s a substantial improvement when nnkmuch-greater-than𝑛𝑛𝑘n\gg n-kitalic_n ≫ italic_n - italic_k.

If we want to use 𝗌𝗒𝗇(xw)𝗌𝗒𝗇subscript𝑥𝑤{\mathsf{syn}}(x_{w})sansserif_syn ( italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) as the sketch of w𝑤witalic_w, then we must choose a code with nk𝑛𝑘n-kitalic_n - italic_k very small. In particular, the entropy of w𝑤witalic_w is at most log(ns)slognbinomial𝑛𝑠𝑠𝑛\log{\binom{n}{s}}\approx s\log nroman_log ( FRACOP start_ARG italic_n end_ARG start_ARG italic_s end_ARG ) ≈ italic_s roman_log italic_n, and so the entropy loss nk𝑛𝑘n-kitalic_n - italic_k had better be at most slogn𝑠𝑛s\log nitalic_s roman_log italic_n. Binary BCH codes are suitable for our purposes: they are a family of [n,k,δ]2subscript𝑛𝑘𝛿2[n,k,\delta]_{2}[ italic_n , italic_k , italic_δ ] start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT linear codes with δ=2t+1𝛿2𝑡1\delta=2t+1italic_δ = 2 italic_t + 1 and k=ntm𝑘𝑛𝑡𝑚k=n-tmitalic_k = italic_n - italic_t italic_m (assuming n=2m1𝑛superscript2𝑚1n=2^{m}-1italic_n = 2 start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT - 1) (see, e.g. [vL92]). These codes are optimal for tnmuch-less-than𝑡𝑛t\ll nitalic_t ≪ italic_n by the Hamming bound, which implies that knlog(nt)𝑘𝑛binomial𝑛𝑡k\leq n-\log{\binom{n}{t}}italic_k ≤ italic_n - roman_log ( FRACOP start_ARG italic_n end_ARG start_ARG italic_t end_ARG ) [vL92].101010The Hamming bound is based on the observation that for any code of distance δ𝛿\deltaitalic_δ, the balls of radius (δ1)/2𝛿12\left\lfloor{(\delta-1)/2}\right\rfloor⌊ ( italic_δ - 1 ) / 2 ⌋ centered at various codewords must be disjoint. Each such ball contains (n(δ1)/2)binomial𝑛𝛿12{\binom{n}{\left\lfloor{(\delta-1)/2}\right\rfloor}}( FRACOP start_ARG italic_n end_ARG start_ARG ⌊ ( italic_δ - 1 ) / 2 ⌋ end_ARG ) points, and so 2k(n(δ1)/2)2nsuperscript2𝑘binomial𝑛𝛿12superscript2𝑛2^{k}{\binom{n}{\left\lfloor{(\delta-1)/2}\right\rfloor}}\leq 2^{n}2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_n end_ARG start_ARG ⌊ ( italic_δ - 1 ) / 2 ⌋ end_ARG ) ≤ 2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. In our case δ=2t+1𝛿2𝑡1\delta=2t+1italic_δ = 2 italic_t + 1, and so the bound yields knlog(nt)𝑘𝑛binomial𝑛𝑡k\leq n-\log{\binom{n}{t}}italic_k ≤ italic_n - roman_log ( FRACOP start_ARG italic_n end_ARG start_ARG italic_t end_ARG ). Using the syndrome sketch with a BCH code C𝐶Citalic_C, we get entropy loss nk=tlog(n+1)𝑛𝑘𝑡𝑛1n-k=t\log(n+1)italic_n - italic_k = italic_t roman_log ( italic_n + 1 ), essentially the same as the tlogn𝑡𝑛t\log nitalic_t roman_log italic_n of the improved Juels-Sudan scheme (recall that δ2t+1𝛿2𝑡1\delta\geq 2t+1italic_δ ≥ 2 italic_t + 1 allows us to correct t𝑡titalic_t set difference errors).

The only problem is that the scheme appears to require computation time Ω(n)Ω𝑛\Omega(n)roman_Ω ( italic_n ), since we must compute 𝗌𝗒𝗇(xw)=Hxw𝗌𝗒𝗇subscript𝑥𝑤𝐻subscript𝑥𝑤{\mathsf{syn}}(x_{w})=Hx_{w}sansserif_syn ( italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) = italic_H italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and, later, run a decoding algorithm to recover xwsubscript𝑥𝑤x_{w}italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT. For BCH codes, this difficulty can be overcome. A word of small weight w𝑤witalic_w can be described by listing the positions on which it is nonzero. We call this description the support of xwsubscript𝑥𝑤x_{w}italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and write 𝗌𝗎𝗉𝗉(xw)𝗌𝗎𝗉𝗉subscript𝑥𝑤{\mathsf{supp}}(x_{w})sansserif_supp ( italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) (note that 𝗌𝗎𝗉𝗉(xw)=w𝗌𝗎𝗉𝗉subscript𝑥𝑤𝑤{\mathsf{supp}}(x_{w})=wsansserif_supp ( italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) = italic_w; see the discussion of enlarging the universe appropriately at the beginning of Section 6).

The following lemma holds for general BCH codes (which include binary BCH codes and Reed-Solomon codes as special cases). We state it for binary codes since that is most relevant to the application:

Lemma 6.2.

For a [n,k,δ]𝑛𝑘𝛿[n,k,\delta][ italic_n , italic_k , italic_δ ] binary BCH code C𝐶Citalic_C one can compute:

  • \bullet

    𝗌𝗒𝗇(x)𝗌𝗒𝗇𝑥{\mathsf{syn}}(x)sansserif_syn ( italic_x ), given 𝗌𝗎𝗉𝗉(x)𝗌𝗎𝗉𝗉𝑥{\mathsf{supp}}(x)sansserif_supp ( italic_x ), in time polynomial in δ𝛿\deltaitalic_δ, logn𝑛\log nroman_log italic_n, and |𝗌𝗎𝗉𝗉(x)|𝗌𝗎𝗉𝗉𝑥|{\mathsf{supp}}(x)|| sansserif_supp ( italic_x ) |

  • \bullet

    𝗌𝗎𝗉𝗉(x)𝗌𝗎𝗉𝗉𝑥{\mathsf{supp}}(x)sansserif_supp ( italic_x ), given 𝗌𝗒𝗇(x)𝗌𝗒𝗇𝑥{\mathsf{syn}}(x)sansserif_syn ( italic_x ) (when x𝑥xitalic_x has weight at most (δ1)/2𝛿12(\delta-1)/2( italic_δ - 1 ) / 2), in time polynomial in δ𝛿\deltaitalic_δ and logn𝑛\log nroman_log italic_n.

The proof of Lemma 6.2 requires a careful reworking of the standard BCH decoding algorithm. The details are presented in Appendix E. For now, we present the resulting secure sketch for set difference.

Construction 6 (PinSketch).

To compute 𝖲𝖲(w)=𝗌𝗒𝗇(xw)𝖲𝖲𝑤𝗌𝗒𝗇subscript𝑥𝑤\mathsf{SS}(w)={\mathsf{syn}}(x_{w})sansserif_SS ( italic_w ) = sansserif_syn ( italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ):

  • 1.

    Let si=xwxisubscript𝑠𝑖subscript𝑥𝑤superscript𝑥𝑖s_{i}=\sum_{x\in w}x^{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_x ∈ italic_w end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT (computations in 𝐺𝐹(2m)𝐺𝐹superscript2𝑚\mathit{GF}(2^{m})italic_GF ( 2 start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT )).

  • 2.

    Output 𝖲𝖲(w)=(s1,s3,s5,,s2t1)𝖲𝖲𝑤subscript𝑠1subscript𝑠3subscript𝑠5subscript𝑠2𝑡1\mathsf{SS}(w)=(s_{1},s_{3},s_{5},\dots,s_{2t-1})sansserif_SS ( italic_w ) = ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT 2 italic_t - 1 end_POSTSUBSCRIPT ).

To recover 𝖱𝖾𝖼(w,(s1,s3,,s2t1))𝖱𝖾𝖼superscript𝑤subscript𝑠1subscript𝑠3subscript𝑠2𝑡1\mathsf{Rec}(w^{\prime},(s_{1},s_{3},\dots,s_{2t-1}))sansserif_Rec ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT 2 italic_t - 1 end_POSTSUBSCRIPT ) ):

  • 1.

    Compute (s1,s3,,s2t1)=𝖲𝖲(w)=𝗌𝗒𝗇(xw)subscriptsuperscript𝑠1subscriptsuperscript𝑠3subscriptsuperscript𝑠2𝑡1𝖲𝖲superscript𝑤𝗌𝗒𝗇subscript𝑥superscript𝑤(s^{\prime}_{1},s^{\prime}_{3},\dots,s^{\prime}_{2t-1})=\mathsf{SS}(w^{\prime})={\mathsf{syn}}(x_{w^{\prime}})( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 italic_t - 1 end_POSTSUBSCRIPT ) = sansserif_SS ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = sansserif_syn ( italic_x start_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ).

  • 2.

    Let σi=sisisubscript𝜎𝑖subscriptsuperscript𝑠𝑖subscript𝑠𝑖\sigma_{i}=s^{\prime}_{i}-s_{i}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (in 𝐺𝐹(2m)𝐺𝐹superscript2𝑚\mathit{GF}(2^{m})italic_GF ( 2 start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ), so “--” is the same as “+++”).

  • 3.

    Compute 𝗌𝗎𝗉𝗉(v)𝗌𝗎𝗉𝗉𝑣{\mathsf{supp}}(v)sansserif_supp ( italic_v ) such that 𝗌𝗒𝗇(v)=(σ1,σ3,,σ2t1)𝗌𝗒𝗇𝑣subscript𝜎1subscript𝜎3subscript𝜎2𝑡1{\mathsf{syn}}(v)=(\sigma_{1},\sigma_{3},\dots,\sigma_{2t-1})sansserif_syn ( italic_v ) = ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … , italic_σ start_POSTSUBSCRIPT 2 italic_t - 1 end_POSTSUBSCRIPT ) and |𝗌𝗎𝗉𝗉(v)|t𝗌𝗎𝗉𝗉𝑣𝑡|{\mathsf{supp}}(v)|\leq t| sansserif_supp ( italic_v ) | ≤ italic_t by Lemma 6.2.

  • 4.

    If 𝖽𝗂𝗌(w,w)t𝖽𝗂𝗌𝑤superscript𝑤𝑡{\mathsf{dis}(w,w^{\prime})}\leq tsansserif_dis ( italic_w , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ italic_t, then 𝗌𝗎𝗉𝗉(v)=ww𝗌𝗎𝗉𝗉𝑣𝑤superscript𝑤{\mathsf{supp}}(v)=w\triangle w^{\prime}sansserif_supp ( italic_v ) = italic_w △ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Thus, output w=w𝗌𝗎𝗉𝗉(v)𝑤superscript𝑤𝗌𝗎𝗉𝗉𝑣w=w^{\prime}\triangle{\mathsf{supp}}(v)italic_w = italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT △ sansserif_supp ( italic_v ).

An implementation of this construction, including the reworked BCH decoding algorithm, is available [HJR06].

The bound on entropy loss is easy to see: the output is tlog(n+1)𝑡𝑛1t\log(n+1)italic_t roman_log ( italic_n + 1 ) bits long, and hence the entropy loss is at most tlog(n+1)𝑡𝑛1t\log(n+1)italic_t roman_log ( italic_n + 1 ) by Lemma 3.1. We obtain:

Theorem 6.3.

PinSketch is an average-case (𝖲𝖣𝗂𝖿(𝒰),m,mtlog(n+1),t)𝖲𝖣𝗂𝖿𝒰𝑚𝑚𝑡𝑛1𝑡({\sf SDif}({\cal U}),m,m-t\log(n+1),t)( sansserif_SDif ( caligraphic_U ) , italic_m , italic_m - italic_t roman_log ( italic_n + 1 ) , italic_t ) secure sketch for set difference with storage tlog(n+1)𝑡𝑛1t\log(n+1)italic_t roman_log ( italic_n + 1 ). The algorithms 𝖲𝖲𝖲𝖲\mathsf{SS}sansserif_SS and 𝖱𝖾𝖼𝖱𝖾𝖼\mathsf{Rec}sansserif_Rec both run in time polynomial in t𝑡titalic_t and logn𝑛\log nroman_log italic_n.

7 Constructions for Edit Distance

The space of interest in this section is the space *superscript{\cal F}^{*}caligraphic_F start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT for some alphabet {\cal F}caligraphic_F, with distance between two strings defined as the number of character insertions and deletions needed to get from one string to the other. Denote this space by 𝖤𝖽𝗂𝗍(n)subscript𝖤𝖽𝗂𝗍𝑛{\sf Edit}_{\cal F}(n)sansserif_Edit start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ( italic_n ). Let F=||𝐹F=|{\cal F}|italic_F = | caligraphic_F |.

First, note that applying the generic approach for transitive metric spaces (as with the Hamming space and the set difference space for small universe sizes) does not work here, because the edit metric is not known to be transitive. Instead, we consider embeddings of the edit metric on {0,1}nsuperscript01𝑛\{0,1\}^{n}{ 0 , 1 } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT into the Hamming or set difference metric of much larger dimension. We look at two types: standard low-distortion embeddings and “biometric” embeddings as defined in Section 4.3.

For the binary edit distance space of dimension n𝑛nitalic_n, we obtain secure sketches and fuzzy extractors correcting t𝑡titalic_t errors with entropy loss roughly tno(1)𝑡superscript𝑛𝑜1tn^{o(1)}italic_t italic_n start_POSTSUPERSCRIPT italic_o ( 1 ) end_POSTSUPERSCRIPT, using a standard embedding, and 2.38tnlogn32.383𝑡𝑛𝑛2.38\sqrt[3]{tn\log n}2.38 nth-root start_ARG 3 end_ARG start_ARG italic_t italic_n roman_log italic_n end_ARG, using a relaxed embedding. The first technique works better when t𝑡titalic_t is small, say, n1γsuperscript𝑛1𝛾n^{1-\gamma}italic_n start_POSTSUPERSCRIPT 1 - italic_γ end_POSTSUPERSCRIPT for a constant γ>0𝛾0\gamma>0italic_γ > 0. The second technique is better when t𝑡titalic_t is large; it is meaningful roughly as long as t<n15log2n𝑡𝑛15superscript2𝑛t<\frac{n}{15\log^{2}n}italic_t < divide start_ARG italic_n end_ARG start_ARG 15 roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n end_ARG.

7.1 Low-Distortion Embeddings

A (standard) embedding with distortion D𝐷Ditalic_D is an injection ψ:12:𝜓subscript1subscript2\psi:{\cal M}_{1}\hookrightarrow{\cal M}_{2}italic_ψ : caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↪ caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT such that for any two points x,y1𝑥𝑦subscript1x,y\in{\cal M}_{1}italic_x , italic_y ∈ caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, the ratio 𝖽𝗂𝗌(ψ(x),ψ(y))𝖽𝗂𝗌(x,y)𝖽𝗂𝗌𝜓𝑥𝜓𝑦𝖽𝗂𝗌𝑥𝑦\frac{{\mathsf{dis}(\psi(x),\psi(y))}}{{\mathsf{dis}(x,y)}}divide start_ARG sansserif_dis ( italic_ψ ( italic_x ) , italic_ψ ( italic_y ) ) end_ARG start_ARG sansserif_dis ( italic_x , italic_y ) end_ARG is at least 1 and at most D𝐷Ditalic_D.

When the preliminary version of this paper appeared [DRS04], no nontrivial embeddings were known mapping edit distance into 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT or the Hamming metric (i.e., known embeddings had distortion O(n)𝑂𝑛O(n)italic_O ( italic_n )). Recently, Ostrovsky and Rabani [OR05] gave an embedding of the edit metric over ={0,1}01{\cal F}=\{0,1\}caligraphic_F = { 0 , 1 } into 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT with subpolynomial distortion. It is an injective, polynomial-time computable embedding, which can be interpreted as mapping to the Hamming space {0,1}dsuperscript01𝑑\{0,1\}^{d}{ 0 , 1 } start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, where d=poly(n)𝑑poly𝑛d=\operatorname{poly}(n)italic_d = roman_poly ( italic_n )111111The embedding of [OR05] produces strings of integers in the space {1,,O(logn)}poly(n)superscript1𝑂𝑛poly𝑛\left\{{1,\dots,O(\log n)}\right\}^{\operatorname{poly}(n)}{ 1 , … , italic_O ( roman_log italic_n ) } start_POSTSUPERSCRIPT roman_poly ( italic_n ) end_POSTSUPERSCRIPT, equipped with 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT distance. One can convert this into the Hamming metric with only a logarithmic blowup in length by representing each integer in unary.

Fact 7.1 ([OR05]).

There is a polynomial-time computable embedding ψed:𝖤𝖽𝗂𝗍{0,1}(n){0,1}poly(n)normal-:subscript𝜓normal-ednormal-↪subscript𝖤𝖽𝗂𝗍01𝑛superscript01normal-poly𝑛\psi_{\rm ed}:{\sf Edit}_{\{0,1\}}(n)\hookrightarrow\{0,1\}^{\operatorname{poly}(n)}italic_ψ start_POSTSUBSCRIPT roman_ed end_POSTSUBSCRIPT : sansserif_Edit start_POSTSUBSCRIPT { 0 , 1 } end_POSTSUBSCRIPT ( italic_n ) ↪ { 0 , 1 } start_POSTSUPERSCRIPT roman_poly ( italic_n ) end_POSTSUPERSCRIPT with distortion Ded(n)=def2O(lognloglogn)superscriptnormal-defsubscript𝐷normal-ed𝑛superscript2𝑂𝑛𝑛D_{\rm ed}(n)\stackrel{{\scriptstyle\rm def}}{{=}}2^{O(\sqrt{\log n\log\log n})}italic_D start_POSTSUBSCRIPT roman_ed end_POSTSUBSCRIPT ( italic_n ) start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG roman_def end_ARG end_RELOP 2 start_POSTSUPERSCRIPT italic_O ( square-root start_ARG roman_log italic_n roman_log roman_log italic_n end_ARG ) end_POSTSUPERSCRIPT.

We can compose this embedding with the fuzzy extractor constructions for the Hamming distance to obtain a fuzzy extractor for edit distance which will be good when t𝑡titalic_t, the number of errors to be corrected, is quite small. Recall that instantiating the syndrome fuzzy extractor construction (Theorem 5.2) with a BCH code allows one to correct tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT errors out of d𝑑ditalic_d at the cost of tlogd+2log(1ϵ)2superscript𝑡𝑑21italic-ϵ2t^{\prime}\log d+2\log\left({\frac{1}{\epsilon}}\right)-2italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT roman_log italic_d + 2 roman_log ( divide start_ARG 1 end_ARG start_ARG italic_ϵ end_ARG ) - 2 bits of entropy.

Construction 7.

For any length n𝑛nitalic_n and error threshold t𝑡titalic_t, let ψedsubscript𝜓ed\psi_{\rm ed}italic_ψ start_POSTSUBSCRIPT roman_ed end_POSTSUBSCRIPT be the embedding given by Fact 7.1 from 𝖤𝖽𝗂𝗍{0,1}(n)subscript𝖤𝖽𝗂𝗍01𝑛{\sf Edit}_{\{0,1\}}(n)sansserif_Edit start_POSTSUBSCRIPT { 0 , 1 } end_POSTSUBSCRIPT ( italic_n ) into {0,1}dsuperscript01𝑑\{0,1\}^{d}{ 0 , 1 } start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT (where d=poly(n)𝑑poly𝑛d=\operatorname{poly}(n)italic_d = roman_poly ( italic_n )), and let 𝗌𝗒𝗇𝗌𝗒𝗇{\mathsf{syn}}sansserif_syn be the syndrome of a BCH code correcting t=tDed(n)superscript𝑡𝑡subscript𝐷ed𝑛t^{\prime}=tD_{\rm ed}(n)italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_t italic_D start_POSTSUBSCRIPT roman_ed end_POSTSUBSCRIPT ( italic_n ) errors in {0,1}dsuperscript01𝑑\{0,1\}^{d}{ 0 , 1 } start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Let {Hx}xXsubscriptsubscript𝐻𝑥𝑥𝑋\{H_{x}\}_{x\in X}{ italic_H start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_x ∈ italic_X end_POSTSUBSCRIPT be a family of universal hash functions from {0,1}dsuperscript01𝑑\{0,1\}^{d}{ 0 , 1 } start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT to {0,1}superscript01\{0,1\}^{\ell}{ 0 , 1 } start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT for some \ellroman_ℓ. To compute 𝖦𝖾𝗇𝖦𝖾𝗇\mathsf{Gen}sansserif_Gen on input w𝖤𝖽𝗂𝗍{0,1}(n)𝑤subscript𝖤𝖽𝗂𝗍01𝑛w\in{\sf Edit}_{\{0,1\}}(n)italic_w ∈ sansserif_Edit start_POSTSUBSCRIPT { 0 , 1 } end_POSTSUBSCRIPT ( italic_n ), pick a random x𝑥xitalic_x and output

R=Hx(ψed(w)),P=(𝗌𝗒𝗇(ψed(w)),x).formulae-sequence𝑅subscript𝐻𝑥subscript𝜓ed𝑤𝑃𝗌𝗒𝗇subscript𝜓ed𝑤𝑥R=H_{x}(\psi_{\rm ed}(w))\ ,P=({\mathsf{syn}}(\psi_{\rm ed}(w)),x)\,.italic_R = italic_H start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_ψ start_POSTSUBSCRIPT roman_ed end_POSTSUBSCRIPT ( italic_w ) ) , italic_P = ( sansserif_syn ( italic_ψ start_POSTSUBSCRIPT roman_ed end_POSTSUBSCRIPT ( italic_w ) ) , italic_x ) .

To compute 𝖱𝖾𝗉𝖱𝖾𝗉\mathsf{Rep}sansserif_Rep on inputs wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and P=(s,x)𝑃𝑠𝑥P=(s,x)italic_P = ( italic_s , italic_x ), compute y=𝖱𝖾𝖼(ψed(w),s)𝑦𝖱𝖾𝖼subscript𝜓edsuperscript𝑤𝑠y=\mathsf{Rec}(\psi_{\rm ed}(w^{\prime}),s)italic_y = sansserif_Rec ( italic_ψ start_POSTSUBSCRIPT roman_ed end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , italic_s ), where 𝖱𝖾𝖼𝖱𝖾𝖼\mathsf{Rec}sansserif_Rec is from Construction 3, and output R=Hx(y)𝑅subscript𝐻𝑥𝑦R=H_{x}(y)italic_R = italic_H start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_y ).

Because ψedsubscript𝜓ed\psi_{\rm ed}italic_ψ start_POSTSUBSCRIPT roman_ed end_POSTSUBSCRIPT is injective, a secure sketch can be constructed similarly: 𝖲𝖲(w)=𝗌𝗒𝗇(ψ(w))𝖲𝖲𝑤𝗌𝗒𝗇𝜓𝑤\mathsf{SS}(w)={\mathsf{syn}}(\psi(w))sansserif_SS ( italic_w ) = sansserif_syn ( italic_ψ ( italic_w ) ), and to recover w𝑤witalic_w from wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and s𝑠sitalic_s, compute ψed1(𝖱𝖾𝖼(ψed(w)))superscriptsubscript𝜓ed1𝖱𝖾𝖼subscript𝜓edsuperscript𝑤\psi_{\rm ed}^{-1}(\mathsf{Rec}(\psi_{\rm ed}(w^{\prime})))italic_ψ start_POSTSUBSCRIPT roman_ed end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( sansserif_Rec ( italic_ψ start_POSTSUBSCRIPT roman_ed end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ). However, it is not known to be efficient, because it is not known how to compute ψed1subscriptsuperscript𝜓1ed\psi^{-1}_{\rm ed}italic_ψ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ed end_POSTSUBSCRIPT efficiently.

Proposition 7.2.

For any n,t,m𝑛𝑡𝑚n,t,mitalic_n , italic_t , italic_m, there is an average-case (𝖤𝖽𝗂𝗍{0,1}(n),m,m,t)subscript𝖤𝖽𝗂𝗍01𝑛𝑚superscript𝑚normal-′𝑡({\sf Edit}_{\{0,1\}}(n),m,m^{\prime},\allowbreak t)( sansserif_Edit start_POSTSUBSCRIPT { 0 , 1 } end_POSTSUBSCRIPT ( italic_n ) , italic_m , italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_t )-secure sketch and an efficient average-case (𝖤𝖽𝗂𝗍{0,1}(n),m,,t,ϵ)subscript𝖤𝖽𝗂𝗍01𝑛𝑚normal-ℓ𝑡italic-ϵ({\sf Edit}_{\{0,1\}}(n),m,\ell,t,\epsilon)( sansserif_Edit start_POSTSUBSCRIPT { 0 , 1 } end_POSTSUBSCRIPT ( italic_n ) , italic_m , roman_ℓ , italic_t , italic_ϵ )-fuzzy extractor where m=mt2O(lognloglogn)superscript𝑚normal-′𝑚𝑡superscript2𝑂𝑛𝑛m^{\prime}=m-t2^{O(\sqrt{\log n\log\log n})}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_m - italic_t 2 start_POSTSUPERSCRIPT italic_O ( square-root start_ARG roman_log italic_n roman_log roman_log italic_n end_ARG ) end_POSTSUPERSCRIPT and =m2log(1ϵ)+2normal-ℓsuperscript𝑚normal-′21italic-ϵ2\ell=m^{\prime}-2\log\left({\frac{1}{\epsilon}}\right)+2roman_ℓ = italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 2 roman_log ( divide start_ARG 1 end_ARG start_ARG italic_ϵ end_ARG ) + 2. In particular, for any α<1𝛼1\alpha<1italic_α < 1, there exists an efficient fuzzy extractor tolerating nαsuperscript𝑛𝛼n^{\alpha}italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT errors with entropy loss nα+o(1)+2log(1ϵ)superscript𝑛𝛼𝑜121italic-ϵn^{\alpha+o(1)}+2\log\left({\frac{1}{\epsilon}}\right)italic_n start_POSTSUPERSCRIPT italic_α + italic_o ( 1 ) end_POSTSUPERSCRIPT + 2 roman_log ( divide start_ARG 1 end_ARG start_ARG italic_ϵ end_ARG ).

Proof.

Construction 7 is the same as the construction of Theorem 5.2 (instantiated with a BCH-code-based syndrome construction) acting on ψed(w)subscript𝜓ed𝑤\psi_{\rm ed}(w)italic_ψ start_POSTSUBSCRIPT roman_ed end_POSTSUBSCRIPT ( italic_w ). Because ψedsubscript𝜓ed\psi_{\rm ed}italic_ψ start_POSTSUBSCRIPT roman_ed end_POSTSUBSCRIPT is injective, the min-entropy of ψed(w)subscript𝜓ed𝑤\psi_{\rm ed}(w)italic_ψ start_POSTSUBSCRIPT roman_ed end_POSTSUBSCRIPT ( italic_w ) is the same as the min-entropy m𝑚mitalic_m of w𝑤witalic_w. The entropy loss in Construction 3 instantiated with BCH codes is tlogd=t2O(lognloglogn)logpoly(n)superscript𝑡𝑑𝑡superscript2𝑂𝑛𝑛poly𝑛t^{\prime}\log d=t2^{O(\sqrt{\log n\log\log n})}\log\operatorname{poly}(n)italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT roman_log italic_d = italic_t 2 start_POSTSUPERSCRIPT italic_O ( square-root start_ARG roman_log italic_n roman_log roman_log italic_n end_ARG ) end_POSTSUPERSCRIPT roman_log roman_poly ( italic_n ). Because 2O(lognloglogn)superscript2𝑂𝑛𝑛2^{O(\sqrt{\log n\log\log n})}2 start_POSTSUPERSCRIPT italic_O ( square-root start_ARG roman_log italic_n roman_log roman_log italic_n end_ARG ) end_POSTSUPERSCRIPT grows faster than logn𝑛\log nroman_log italic_n, this is the same as t2O(lognloglogn)𝑡superscript2𝑂𝑛𝑛t2^{O(\sqrt{\log n\log\log n})}italic_t 2 start_POSTSUPERSCRIPT italic_O ( square-root start_ARG roman_log italic_n roman_log roman_log italic_n end_ARG ) end_POSTSUPERSCRIPT. ∎

Note that the peculiar-looking distortion function from Fact 7.1 increases more slowly than any polynomial in n𝑛nitalic_n, but still faster than any polynomial in logn𝑛\log nroman_log italic_n. In sharp contrast, the best lower bound states that any embedding of 𝖤𝖽𝗂𝗍{0,1}(n)subscript𝖤𝖽𝗂𝗍01𝑛{\sf Edit}_{\{0,1\}}(n)sansserif_Edit start_POSTSUBSCRIPT { 0 , 1 } end_POSTSUBSCRIPT ( italic_n ) into 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (and hence Hamming) must have distortion at least Ω(logn/loglogn)Ω𝑛𝑛\Omega(\log n/\log\log n)roman_Ω ( roman_log italic_n / roman_log roman_log italic_n ) [AK07]. Closing the gap between the two bounds remains an open problem.

General Alphabets.  To extend the above construction to general {\cal F}caligraphic_F, we represent each character of {\cal F}caligraphic_F as a string of logF𝐹\log Froman_log italic_F bits. This is an embedding nsuperscript𝑛{\cal F}^{n}caligraphic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT into {0,1}nlogFsuperscript01𝑛𝐹\{0,1\}^{n\log F}{ 0 , 1 } start_POSTSUPERSCRIPT italic_n roman_log italic_F end_POSTSUPERSCRIPT, which increases edit distance by a factor of at most logF𝐹\log Froman_log italic_F. Then t=t(logF)Ded(n)superscript𝑡𝑡𝐹subscript𝐷ed𝑛t^{\prime}=t(\log F)D_{\rm ed}(n)italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_t ( roman_log italic_F ) italic_D start_POSTSUBSCRIPT roman_ed end_POSTSUBSCRIPT ( italic_n ) and d=poly(n,logF)𝑑poly𝑛𝐹d=\operatorname{poly}(n,\log F)italic_d = roman_poly ( italic_n , roman_log italic_F ). Using these quantities, we get the generalization of Proposition 7.2 for larger alphabets (again, by the same embedding) by changing the formula for msuperscript𝑚m^{\prime}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to m=mt(logF)2O(log(nlogF)loglog(nlogF))superscript𝑚𝑚𝑡𝐹superscript2𝑂𝑛𝐹𝑛𝐹m^{\prime}=m-t(\log F)2^{O(\sqrt{\log(n\log F)\log\log(n\log F)})}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_m - italic_t ( roman_log italic_F ) 2 start_POSTSUPERSCRIPT italic_O ( square-root start_ARG roman_log ( italic_n roman_log italic_F ) roman_log roman_log ( italic_n roman_log italic_F ) end_ARG ) end_POSTSUPERSCRIPT.

7.2 Relaxed Embeddings for the Edit Metric

In this section, we show that a relaxed notion of embedding, called a biometric embedding in Section 4.3, can produce fuzzy extractors and secure sketches that are better than what one can get from the embedding of [OR05] when t𝑡titalic_t is large (they are also much simpler algorithmically, which makes them more practical). We first discuss fuzzy extractors and later extend the technique to secure sketches.

Fuzzy Extractors.  Recall that unlike low-distortion embeddings, biometric embeddings do not care about relative distances, as long as points that were “close” (closer than t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) do not become “distant” (farther apart than t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT). The only additional requirement of a biometric embedding is that it preserve some min-entropy: we do not want too many points to collide together. We now describe such an embedding from the edit distance to the set difference.

A c𝑐citalic_c-shingle is a length-c𝑐citalic_c consecutive substring of a given string w𝑤witalic_w. A c𝑐citalic_c-shingling [Bro97] of a string w𝑤witalic_w of length n𝑛nitalic_n is the set (ignoring order or repetition) of all (nc+1)𝑛𝑐1(n-c+1)( italic_n - italic_c + 1 ) c𝑐citalic_c-shingles of w𝑤witalic_w. (For instance, a 3-shingling of “abcdecdeah” is {abc, bcd, cde, dec, ecd, dea, eah}.) Thus, the range of the c𝑐citalic_c-shingling operation consists of all nonempty subsets of size at most nc+1𝑛𝑐1n-c+1italic_n - italic_c + 1 of csuperscript𝑐{\cal F}^{c}caligraphic_F start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT. Let 𝖲𝖣𝗂𝖿(c)𝖲𝖣𝗂𝖿superscript𝑐{\sf SDif}({\cal F}^{c})sansserif_SDif ( caligraphic_F start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) stand for the set difference metric over subsets of csuperscript𝑐{\cal F}^{c}caligraphic_F start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and 𝖲𝖧csubscript𝖲𝖧𝑐{\sf SH}_{c}sansserif_SH start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT stand for the c𝑐citalic_c-shingling map from 𝖤𝖽𝗂𝗍(n)subscript𝖤𝖽𝗂𝗍𝑛{\sf Edit}_{\cal F}(n)sansserif_Edit start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ( italic_n ) to 𝖲𝖣𝗂𝖿(c)𝖲𝖣𝗂𝖿superscript𝑐{\sf SDif}({\cal F}^{c})sansserif_SDif ( caligraphic_F start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ). We now show that 𝖲𝖧csubscript𝖲𝖧𝑐{\sf SH}_{c}sansserif_SH start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is a good biometric embedding.

Lemma 7.3.

For any c𝑐citalic_c, 𝖲𝖧csubscript𝖲𝖧𝑐{\sf SH}_{c}sansserif_SH start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is an average-case (t1,t2=(2c1)t1,m1,m2=m1nclog2(nc+1))formulae-sequencesubscript𝑡1subscript𝑡22𝑐1subscript𝑡1subscript𝑚1subscript𝑚2subscript𝑚1𝑛𝑐subscript2𝑛𝑐1(t_{1},t_{2}=(2c-1)t_{1},m_{1},m_{2}=m_{1}-\lceil\frac{n}{c}\rceil\log_{2}(n-c+1))( italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( 2 italic_c - 1 ) italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - ⌈ divide start_ARG italic_n end_ARG start_ARG italic_c end_ARG ⌉ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_n - italic_c + 1 ) )-biometric embedding of 𝖤𝖽𝗂𝗍(n)subscript𝖤𝖽𝗂𝗍𝑛{\sf Edit}_{\cal F}(n)sansserif_Edit start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ( italic_n ) into 𝖲𝖣𝗂𝖿(c)𝖲𝖣𝗂𝖿superscript𝑐{\sf SDif}({\cal F}^{c})sansserif_SDif ( caligraphic_F start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ).

Proof.

Let w,w𝖤𝖽𝗂𝗍(n)𝑤superscript𝑤subscript𝖤𝖽𝗂𝗍𝑛w,w^{\prime}\in{\sf Edit}_{\cal F}(n)italic_w , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ sansserif_Edit start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ( italic_n ) be such that 𝖽𝗂𝗌(w,w)t1𝖽𝗂𝗌𝑤superscript𝑤subscript𝑡1{\mathsf{dis}(w,w^{\prime})}\leq t_{1}sansserif_dis ( italic_w , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and I𝐼Iitalic_I be the sequence of at most t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT insertions and deletions that transforms w𝑤witalic_w into wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. It is easy to see that each character deletion or insertion adds at most (2c1)2𝑐1(2c-1)( 2 italic_c - 1 ) to the symmetric difference between 𝖲𝖧c(w)subscript𝖲𝖧𝑐𝑤{\sf SH}_{c}(w)sansserif_SH start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_w ) and 𝖲𝖧c(w)subscript𝖲𝖧𝑐superscript𝑤{\sf SH}_{c}(w^{\prime})sansserif_SH start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), which implies that 𝖽𝗂𝗌(𝖲𝖧c(w),𝖲𝖧c(w))(2c1)t1𝖽𝗂𝗌subscript𝖲𝖧𝑐𝑤subscript𝖲𝖧𝑐superscript𝑤2𝑐1subscript𝑡1{\mathsf{dis}({\sf SH}_{c}(w),{\sf SH}_{c}(w^{\prime}))}\leq(2c-1)t_{1}sansserif_dis ( sansserif_SH start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_w ) , sansserif_SH start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ≤ ( 2 italic_c - 1 ) italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, as needed.

For wn𝑤superscript𝑛w\in{\cal F}^{n}italic_w ∈ caligraphic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, define gc(w)subscript𝑔𝑐𝑤g_{c}(w)italic_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_w ) as follows. Compute 𝖲𝖧c(w)subscript𝖲𝖧𝑐𝑤{\sf SH}_{c}(w)sansserif_SH start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_w ) and store the resulting shingles in lexicographic order h1hksubscript1subscript𝑘h_{1}\ldots h_{k}italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (knc+1𝑘𝑛𝑐1k\leq n-c+1italic_k ≤ italic_n - italic_c + 1). Next, naturally partition w𝑤witalic_w into n/c𝑛𝑐\lceil n/c\rceil⌈ italic_n / italic_c ⌉ c𝑐citalic_c-shingles s1sn/csubscript𝑠1subscript𝑠𝑛𝑐s_{1}\ldots s_{\lceil n/c\rceil}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_s start_POSTSUBSCRIPT ⌈ italic_n / italic_c ⌉ end_POSTSUBSCRIPT, all disjoint except for (possibly) the last two, which overlap by cn/cn𝑐𝑛𝑐𝑛c\lceil n/c\rceil-nitalic_c ⌈ italic_n / italic_c ⌉ - italic_n characters. Next, for 1jn/c1𝑗𝑛𝑐1\leq j\leq\lceil n/c\rceil1 ≤ italic_j ≤ ⌈ italic_n / italic_c ⌉, set pjsubscript𝑝𝑗p_{j}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to be the index i{0k}𝑖0𝑘i\in\left\{{0\ldots k}\right\}italic_i ∈ { 0 … italic_k } such that sj=hisubscript𝑠𝑗subscript𝑖s_{j}=h_{i}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. In other words, pjsubscript𝑝𝑗p_{j}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT tells the index of the j𝑗jitalic_jth disjoint shingle of w𝑤witalic_w in the alphabetically ordered k𝑘kitalic_k-set 𝖲𝖧c(w)subscript𝖲𝖧𝑐𝑤{\sf SH}_{c}(w)sansserif_SH start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_w ). Set gc(w)=(p1,,pn/c)subscript𝑔𝑐𝑤subscript𝑝1subscript𝑝𝑛𝑐g_{c}(w)=(p_{1},\dots,p_{\lceil n/c\rceil})italic_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_w ) = ( italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT ⌈ italic_n / italic_c ⌉ end_POSTSUBSCRIPT ). (For instance, g3(g_{3}(italic_g start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT (“abcdecdeah”)=(1,5,4,6))=(1,5,4,6)) = ( 1 , 5 , 4 , 6 ), representing the alphabetical order of “abc”, “dec”, “dea” and “eah” in 𝖲𝖧3({\sf SH}_{3}(sansserif_SH start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT (“abcdecdeah”)))).) The number of possible values for gc(w)subscript𝑔𝑐𝑤g_{c}(w)italic_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_w ) is at most (nc+1)ncsuperscript𝑛𝑐1𝑛𝑐(n-c+1)^{\lceil\frac{n}{c}\rceil}( italic_n - italic_c + 1 ) start_POSTSUPERSCRIPT ⌈ divide start_ARG italic_n end_ARG start_ARG italic_c end_ARG ⌉ end_POSTSUPERSCRIPT, and w𝑤witalic_w can be completely recovered from 𝖲𝖧c(w)subscript𝖲𝖧𝑐𝑤{\sf SH}_{c}(w)sansserif_SH start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_w ) and gc(w)subscript𝑔𝑐𝑤g_{c}(w)italic_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_w ).

Now, assume W𝑊Witalic_W is any distribution of min-entropy at least m1subscript𝑚1m_{1}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT on 𝖤𝖽𝗂𝗍(n)subscript𝖤𝖽𝗂𝗍𝑛{\sf Edit}_{\cal F}(n)sansserif_Edit start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ( italic_n ). Applying Lemma 2.2(b), we get 𝐇~(Wgc(W))m1nclog2(nc+1)subscript~𝐇conditional𝑊subscript𝑔𝑐𝑊subscript𝑚1𝑛𝑐subscript2𝑛𝑐1{\tilde{\mathbf{H}}_{\infty}}(W\mid g_{c}(W))\geq m_{1}\ -\lceil\frac{n}{c}\rceil\log_{2}(n-c+1)over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_W ∣ italic_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_W ) ) ≥ italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - ⌈ divide start_ARG italic_n end_ARG start_ARG italic_c end_ARG ⌉ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_n - italic_c + 1 ). Since Pr(W=wgc(W)=g)=Pr(𝖲𝖧c(W)=𝖲𝖧c(w)gc(W)=g)Pr𝑊conditional𝑤subscript𝑔𝑐𝑊𝑔Prsubscript𝖲𝖧𝑐𝑊conditionalsubscript𝖲𝖧𝑐𝑤subscript𝑔𝑐𝑊𝑔\Pr(W=w\mid g_{c}(W)=g)=\Pr({\sf SH}_{c}(W)={\sf SH}_{c}(w)\mid g_{c}(W)=g)roman_Pr ( italic_W = italic_w ∣ italic_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_W ) = italic_g ) = roman_Pr ( sansserif_SH start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_W ) = sansserif_SH start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_w ) ∣ italic_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_W ) = italic_g ) (because given gc(w)subscript𝑔𝑐𝑤g_{c}(w)italic_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_w ), 𝖲𝖧c(w)subscript𝖲𝖧𝑐𝑤{\sf SH}_{c}(w)sansserif_SH start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_w ) uniquely determines w𝑤witalic_w and vice versa), by applying the definition of 𝐇~subscript~𝐇{\tilde{\mathbf{H}}_{\infty}}over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT, we obtain 𝐇(𝖲𝖧c(W))𝐇~(𝖲𝖧c(W)gc(W))=𝐇~(Wgc(W))subscript𝐇subscript𝖲𝖧𝑐𝑊subscript~𝐇conditionalsubscript𝖲𝖧𝑐𝑊subscript𝑔𝑐𝑊subscript~𝐇conditional𝑊subscript𝑔𝑐𝑊{\mathbf{H}_{\infty}}({\sf SH}_{c}(W))\geq{\tilde{\mathbf{H}}_{\infty}}({\sf SH}_{c}(W)\mid g_{c}(W))={\tilde{\mathbf{H}}_{\infty}}(W\mid g_{c}(W))bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( sansserif_SH start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_W ) ) ≥ over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( sansserif_SH start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_W ) ∣ italic_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_W ) ) = over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_W ∣ italic_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_W ) ). The same proof holds for average min-entropy, conditioned on some auxiliary information I𝐼Iitalic_I. ∎

By Theorem 6.3, for universe csuperscript𝑐{\cal F}^{c}caligraphic_F start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT of size Fcsuperscript𝐹𝑐F^{c}italic_F start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and distance threshold t2=(2c1)t1subscript𝑡22𝑐1subscript𝑡1t_{2}=(2c-1)t_{1}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( 2 italic_c - 1 ) italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we can construct a secure sketch for the set difference metric with entropy loss t2log(Fc+1)subscript𝑡2superscript𝐹𝑐1t_{2}\lceil\log(F^{c}+1)\rceilitalic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⌈ roman_log ( italic_F start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT + 1 ) ⌉ (\lceil\cdot\rceil⌈ ⋅ ⌉ because Theorem 6.3 requires the universe size to be one less than a power of 2). By Lemma 4.3, we can obtain a fuzzy extractor from such a sketch, with additional entropy loss 2log(1ϵ)221italic-ϵ22\log\left({\frac{1}{\epsilon}}\right)-22 roman_log ( divide start_ARG 1 end_ARG start_ARG italic_ϵ end_ARG ) - 2. Applying Lemma 4.6 to the above embedding and this fuzzy extractor, we obtain a fuzzy extractor for 𝖤𝖽𝗂𝗍(n)subscript𝖤𝖽𝗂𝗍𝑛{\sf Edit}_{\cal F}(n)sansserif_Edit start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ( italic_n ), any input entropy m𝑚mitalic_m, any distance t𝑡titalic_t, and any security parameter ϵitalic-ϵ\epsilonitalic_ϵ, with the following entropy loss:

nclog2(nc+1)+(2c1)tlog(Fc+1)+2log(1ϵ)2𝑛𝑐subscript2𝑛𝑐12𝑐1𝑡superscript𝐹𝑐121italic-ϵ2\left\lceil\frac{n}{c}\right\rceil\cdot\log_{2}(n-c+1)+(2c-1)t\lceil\log(F^{c}+1)\rceil+2\log\left({\frac{1}{\epsilon}}\right)-2\,⌈ divide start_ARG italic_n end_ARG start_ARG italic_c end_ARG ⌉ ⋅ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_n - italic_c + 1 ) + ( 2 italic_c - 1 ) italic_t ⌈ roman_log ( italic_F start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT + 1 ) ⌉ + 2 roman_log ( divide start_ARG 1 end_ARG start_ARG italic_ϵ end_ARG ) - 2

(the first component of the entropy loss comes from the embedding, the second from the secure sketch for set difference, and the third from the extractor). The above sequence of lemmas results in the following construction, parameterized by shingle length c𝑐citalic_c and a family of universal hash functions ={𝖲𝖣𝗂𝖿(c){0,1}l}xXsubscript𝖲𝖣𝗂𝖿superscript𝑐superscript01𝑙𝑥𝑋{\cal H}=\{{\sf SDif}({\cal F}^{c})\to\{0,1\}^{l}\}_{x\in X}caligraphic_H = { sansserif_SDif ( caligraphic_F start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) → { 0 , 1 } start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_x ∈ italic_X end_POSTSUBSCRIPT, where l𝑙litalic_l is equal to the input entropy m𝑚mitalic_m minus the entropy loss above.

Construction 8 (Fuzzy Extractor for Edit Distance).

To compute 𝖦𝖾𝗇(w)𝖦𝖾𝗇𝑤\mathsf{Gen}(w)sansserif_Gen ( italic_w ) for |w|=n𝑤𝑛|w|=n| italic_w | = italic_n:

  • 1.

    Compute 𝖲𝖧c(w)subscript𝖲𝖧𝑐𝑤{\sf SH}_{c}(w)sansserif_SH start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_w ) by computing nc+1𝑛𝑐1n-c+1italic_n - italic_c + 1 shingles (v1,v2,,vnc+1)subscript𝑣1subscript𝑣2subscript𝑣𝑛𝑐1(v_{1},v_{2},\dots,v_{n-c+1})( italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_n - italic_c + 1 end_POSTSUBSCRIPT ) and removing duplicates to form the shingle set v𝑣vitalic_v from w𝑤witalic_w.

  • 2.

    Compute s=𝗌𝗒𝗇(xv)𝑠𝗌𝗒𝗇subscript𝑥𝑣s={\mathsf{syn}}(x_{v})italic_s = sansserif_syn ( italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) as in Construction 6.

  • 3.

    Select a hash function Hxsubscript𝐻𝑥H_{x}\in{\cal H}italic_H start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∈ caligraphic_H and output (R=Hx(v),P=(s,x))formulae-sequence𝑅subscript𝐻𝑥𝑣𝑃𝑠𝑥(R=H_{x}(v),P=(s,x))( italic_R = italic_H start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_v ) , italic_P = ( italic_s , italic_x ) ).

To compute 𝖱𝖾𝗉(w,(s,x))𝖱𝖾𝗉superscript𝑤𝑠𝑥\mathsf{Rep}(w^{\prime},(s,x))sansserif_Rep ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , ( italic_s , italic_x ) ):

  • 1.

    Compute 𝖲𝖧c(w)subscript𝖲𝖧𝑐superscript𝑤{\sf SH}_{c}(w^{\prime})sansserif_SH start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) as above to get vsuperscript𝑣v^{\prime}italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

  • 2.

    Use 𝖱𝖾𝖼(v,s)𝖱𝖾𝖼superscript𝑣𝑠\mathsf{Rec}(v^{\prime},s)sansserif_Rec ( italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s ) from in Construction 6 to recover v𝑣vitalic_v.

  • 3.

    Output R=Hx(v)𝑅subscript𝐻𝑥𝑣R=H_{x}(v)italic_R = italic_H start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_v ).

We thus obtain the following theorem.

Theorem 7.4.

For any n,m,c𝑛𝑚𝑐n,m,citalic_n , italic_m , italic_c and 0<ϵ10italic-ϵ10<\epsilon\leq 10 < italic_ϵ ≤ 1, there is an efficient average-case (𝖤𝖽𝗂𝗍(n),m,mnclog2(nc+1)(2c1)tlog(Fc+1)2log(1ϵ)+2,t,ϵ)subscript𝖤𝖽𝗂𝗍𝑛𝑚𝑚𝑛𝑐subscript2𝑛𝑐12𝑐1𝑡superscript𝐹𝑐121italic-ϵ2𝑡italic-ϵ({\sf Edit}_{\cal F}(n),m,m-\lceil\frac{n}{c}\rceil\log_{2}(n-c+1)-(2c-1)t\lceil\log(F^{c}+1)\rceil-2\log\left({\frac{1}{\epsilon}}\right)+2,t,\epsilon)( sansserif_Edit start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ( italic_n ) , italic_m , italic_m - ⌈ divide start_ARG italic_n end_ARG start_ARG italic_c end_ARG ⌉ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_n - italic_c + 1 ) - ( 2 italic_c - 1 ) italic_t ⌈ roman_log ( italic_F start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT + 1 ) ⌉ - 2 roman_log ( divide start_ARG 1 end_ARG start_ARG italic_ϵ end_ARG ) + 2 , italic_t , italic_ϵ )-fuzzy extractor.

Note that the choice of c𝑐citalic_c is a parameter; by ignoring \lceil\cdot\rceil⌈ ⋅ ⌉ and replacing nc+1𝑛𝑐1n-c+1italic_n - italic_c + 1 with n𝑛nitalic_n, 2c12𝑐12c-12 italic_c - 1 with 2c2𝑐2c2 italic_c and Fc+1superscript𝐹𝑐1F^{c}+1italic_F start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT + 1 with Fcsuperscript𝐹𝑐F^{c}italic_F start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT, we get that the minimum entropy loss occurs near

c=(nlogn4tlogF)1/3𝑐superscript𝑛𝑛4𝑡𝐹13c=\left(\frac{n\log n}{4t\log F}\right)^{1/3}italic_c = ( divide start_ARG italic_n roman_log italic_n end_ARG start_ARG 4 italic_t roman_log italic_F end_ARG ) start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT

and is about 2.38(tlogF)1/3(nlogn)2/32.38superscript𝑡𝐹13superscript𝑛𝑛232.38\left(t\log F\right)^{1/3}\left(n\log n\right)^{2/3}2.38 ( italic_t roman_log italic_F ) start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT ( italic_n roman_log italic_n ) start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT (2.382.382.382.38 is really 43+1/2334132\sqrt[3]{4}+1/\sqrt[3]{2}nth-root start_ARG 3 end_ARG start_ARG 4 end_ARG + 1 / nth-root start_ARG 3 end_ARG start_ARG 2 end_ARG). In particular, if the original string has a linear amount of entropy θ(nlogF)𝜃𝑛𝐹\theta(n\log F)italic_θ ( italic_n roman_log italic_F ), then we can tolerate t=Ω(nlog2F/log2n)𝑡Ω𝑛superscript2𝐹superscript2𝑛t=\Omega(n\log^{2}F/\log^{2}n)italic_t = roman_Ω ( italic_n roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_F / roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n ) insertions and deletions while extracting θ(nlogF)2log(1ϵ)𝜃𝑛𝐹21italic-ϵ\theta(n\log F)-2\log\left({\frac{1}{\epsilon}}\right)italic_θ ( italic_n roman_log italic_F ) - 2 roman_log ( divide start_ARG 1 end_ARG start_ARG italic_ϵ end_ARG ) bits. The number of bits extracted is linear; if the string length n𝑛nitalic_n is polynomial in the alphabet size F𝐹Fitalic_F, then the number of errors tolerated is linear also.

Secure Sketches.  Observe that the proof of Lemma 7.3 actually demonstrates that our biometric embedding based on shingling is an embedding with recovery information gcsubscript𝑔𝑐g_{c}italic_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. Observe also that it is easy to reconstruct w𝑤witalic_w from 𝖲𝖧c(w)subscript𝖲𝖧𝑐𝑤{\sf SH}_{c}(w)sansserif_SH start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_w ) and gc(w)subscript𝑔𝑐𝑤g_{c}(w)italic_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_w ). Finally, note that PinSketch (Construction 6) is an average-case secure sketch (as are all secure sketches in this work). Thus, combining Theorem 6.3 with Lemma 4.7, we obtain the following theorem.

Construction 9 (Secure Sketch for Edit Distance).

For 𝖲𝖲(w)𝖲𝖲𝑤\mathsf{SS}(w)sansserif_SS ( italic_w ), compute v=𝖲𝖧c(w)𝑣subscript𝖲𝖧𝑐𝑤v={\sf SH}_{c}(w)italic_v = sansserif_SH start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_w ) and s1=𝗌𝗒𝗇(xv)subscript𝑠1𝗌𝗒𝗇subscript𝑥𝑣s_{1}={\mathsf{syn}}(x_{v})italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = sansserif_syn ( italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) as in Construction 8. Compute s2=gc(w)subscript𝑠2subscript𝑔𝑐𝑤s_{2}=g_{c}(w)italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_w ), writing each pjsubscript𝑝𝑗p_{j}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT as a string of logn𝑛\lceil\log n\rceil⌈ roman_log italic_n ⌉ bits. Output s=(s1,s2)𝑠subscript𝑠1subscript𝑠2s=(s_{1},s_{2})italic_s = ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). For 𝖱𝖾𝖼(w,(s1,s2))𝖱𝖾𝖼superscript𝑤subscript𝑠1subscript𝑠2\mathsf{Rec}(w^{\prime},(s_{1},s_{2}))sansserif_Rec ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ), recover v𝑣vitalic_v as in Construction 8, sort it in alphabetical order, and recover w𝑤witalic_w by stringing along elements of v𝑣vitalic_v according to indices in s2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Theorem 7.5.

For any n,m,c𝑛𝑚𝑐n,m,citalic_n , italic_m , italic_c and 0<ϵ10italic-ϵ10<\epsilon\leq 10 < italic_ϵ ≤ 1, there is an efficient average-case (𝖤𝖽𝗂𝗍(n),m,mnclog2(nc+1)(2c1)tlog(Fc+1),t)subscript𝖤𝖽𝗂𝗍𝑛𝑚𝑚𝑛𝑐subscript2𝑛𝑐12𝑐1𝑡superscript𝐹𝑐1𝑡({\sf Edit}_{\cal F}(n),m,m-\lceil\frac{n}{c}\rceil\log_{2}(n-c+1)-(2c-1)t\lceil\log(F^{c}+1)\rceil,t)( sansserif_Edit start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ( italic_n ) , italic_m , italic_m - ⌈ divide start_ARG italic_n end_ARG start_ARG italic_c end_ARG ⌉ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_n - italic_c + 1 ) - ( 2 italic_c - 1 ) italic_t ⌈ roman_log ( italic_F start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT + 1 ) ⌉ , italic_t ) secure sketch.

The discussion about optimal values of c𝑐citalic_c from above applies equally here.

Remark 1.

In our definitions of secure sketches and fuzzy extractors, we required the original w𝑤witalic_w and the (potentially) modified wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to come from the same space {\cal M}caligraphic_M. This requirement was for simplicity of exposition. We can allow wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to come from a larger set, as long as distance from w𝑤witalic_w is well-defined. In the case of edit distance, for instance, wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT can be shorter or longer than w𝑤witalic_w; all the above results will apply as long as it is still within t𝑡titalic_t insertions and deletions.

8 Probabilistic Notions of Correctness

The error model considered so far in this work is very strong: we required that secure sketches and fuzzy extractors accept every secret wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT within distance t𝑡titalic_t of the original input w𝑤witalic_w, with no probability of error.

Such a stringent model is useful as it makes no assumptions on either the exact stochastic properties of the error process or the adversary’s computational limits. However, Lemma C.1 shows that secure sketches (and fuzzy extractors) correcting t𝑡titalic_t errors can only be as “good” as error-correcting codes with minimum distance 2t+12𝑡12t+12 italic_t + 1. By slightly relaxing the correctness condition, we will see that one can tolerate many more errors. For example, there is no good code which can correct n/4𝑛4n/4italic_n / 4 errors in the binary Hamming metric: by the Plotkin bound (see, e.g., [Sud01, Lecture 8]) a code with minimum distance greater than n/2𝑛2n/2italic_n / 2 has at most 2n2𝑛2n2 italic_n codewords. Thus, there is no secure sketch with residual entropy mlognsuperscript𝑚𝑛m^{\prime}\geq\log nitalic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≥ roman_log italic_n which can correct n/4𝑛4n/4italic_n / 4 errors with probability 1. However, with the relaxed notions of correctness below, one can tolerate arbitrarily close to n/2𝑛2n/2italic_n / 2 errors, i.e., correct n(12γ)𝑛12𝛾n(\frac{1}{2}-\gamma)italic_n ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG - italic_γ ) errors for any constant γ>0𝛾0\gamma>0italic_γ > 0, and still have residual entropy Ω(n)Ω𝑛\Omega(n)roman_Ω ( italic_n ).

In this section, we discuss three relaxed error models and show how the constructions of the previous sections can be modified to gain greater error-correction in these models. We will focus on secure sketches for the binary Hamming metric. The same constructions yield fuzzy extractors (by Lemma 4.1). Many of the observations here also apply to metrics other than Hamming.

A common point is that we will require only that the a corrupted input wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT be recovered with probability at least 1α<11𝛼11-\alpha<11 - italic_α < 1 (the probability space varies). We describe each model in terms of the additional assumptions made on the error process. We describe constructions for each model in the subsequent sections.

Random Errors.

Assume there is a known distribution on the errors which occur in the data. For the Hamming metric, the most common distribution is the binary symmetric channel BSCp𝐵𝑆subscript𝐶𝑝BSC_{p}italic_B italic_S italic_C start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT: each bit of the input is flipped with probability p𝑝pitalic_p and left untouched with probability 1p1𝑝1-p1 - italic_p. We require that for any input w𝑤witalic_w, 𝖱𝖾𝖼(W,𝖲𝖲(w))=w𝖱𝖾𝖼superscript𝑊𝖲𝖲𝑤𝑤\mathsf{Rec}(W^{\prime},\mathsf{SS}(w))=wsansserif_Rec ( italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , sansserif_SS ( italic_w ) ) = italic_w with probability at least 1α1𝛼1-\alpha1 - italic_α over the coins of 𝖲𝖲𝖲𝖲\mathsf{SS}sansserif_SS and over Wsuperscript𝑊W^{\prime}italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT drawn applying the noise distribution to w𝑤witalic_w.

In that case, one can correct an error rate up to Shannon’s bound on noisy channel coding. This bound is tight. Unfortunately, the assumption of a known noise process is too strong for most applications: there is no reason to believe we understand the exact distribution on errors which occur in complex data such as biometrics.121212Since the assumption here plays a role only in correctness, it is still more reasonable than assuming that we know exact distributions on the data in proofs of secrecy. However, in both cases, we would like to enlarge the class of distributions for which we can provably satisfy the definition of security. However, it provides a useful baseline by which to measure results for other models.

Input-dependent Errors.

The errors are adversarial, subject only to the conditions that (a) the error magnitude 𝖽𝗂𝗌(w,w)𝖽𝗂𝗌𝑤superscript𝑤{\mathsf{dis}(w,w^{\prime})}sansserif_dis ( italic_w , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is bounded to a maximum of t𝑡titalic_t, and (b) the corrupted word depends only on the input w𝑤witalic_w, and not on the secure sketch 𝖲𝖲(w)𝖲𝖲𝑤\mathsf{SS}(w)sansserif_SS ( italic_w ). Here we require that for any pair w,w𝑤superscript𝑤w,w^{\prime}italic_w , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT at distance at most t𝑡titalic_t, we have 𝖱𝖾𝖼(w,𝖲𝖲(w))=w𝖱𝖾𝖼superscript𝑤𝖲𝖲𝑤𝑤\mathsf{Rec}(w^{\prime},\mathsf{SS}(w))=wsansserif_Rec ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , sansserif_SS ( italic_w ) ) = italic_w with probability at least 1α1𝛼1-\alpha1 - italic_α over the coins of 𝖲𝖲𝖲𝖲\mathsf{SS}sansserif_SS.

This model encompasses any complex noise process which has been observed to never introduce more than t𝑡titalic_t errors. Unlike the assumption of a particular distribution on the noise, the bound on magnitude can be checked experimentally. Perhaps surprisingly, in this model we can tolerate just as large an error rate as in the model of random errors. That is, we can tolerate an error rate up to Shannon’s coding bound and no more.

Computationally bounded Errors.

The errors are adversarial and may depend on both w𝑤witalic_w and the publicly stored information 𝖲𝖲(w)𝖲𝖲𝑤\mathsf{SS}(w)sansserif_SS ( italic_w ). However, we assume that the errors are introduced by a process of bounded computational power. That is, there is a probabilistic circuit of polynomial size (in the length n𝑛nitalic_n) which computes wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT from w𝑤witalic_w. The adversary cannot, for example, forge a digital signature and base the error pattern on the signature.

It is not clear whether this model allows correcting errors up to the Shannon bound, as in the two models above. The question is related to open questions on the construction of efficiently list-decodable codes. However, when the error rate is either very high or very low, then the appropriate list-decodable codes exist and we can indeed match the Shannon bound.

Analogues for Noisy Channels and the Hamming Metric.  Models analogous to the ones above have been studied in the literature on codes for noisy binary channels (with the Hamming metric). Random errors and computationally bounded errors both make obvious sense in the coding context [Sha48, MPSW05]. The second model — input-dependent errors — does not immediately make sense in a coding situation, since there is no data other than the transmitted codeword on which errors could depend. Nonetheless, there is a natural, analogous model for noisy channels: one can allow the sender and receiver to share either (1) common, secret random coins (see [DGL04, Lan04] and references therein) or (2) a side channel with which they can communicate a small number of noise-free, secret bits [Gur03].

Existing results on these three models for the Hamming metric can be transported to our context using the code-offset construction:

𝖲𝖲(w;x)=wC(x).𝖲𝖲𝑤𝑥direct-sum𝑤𝐶𝑥\mathsf{SS}(w;x)=w\oplus C(x)\,.sansserif_SS ( italic_w ; italic_x ) = italic_w ⊕ italic_C ( italic_x ) .

Roughly, any code which corrects errors in the models above will lead to a secure sketch (resp. fuzzy extractor) which corrects errors in the model. We explore the consequences for each of the three models in the next sections.

8.1 Random Errors

The random error model was famously considered by Shannon [Sha48]. He showed that for any discrete, memoryless channel, the rate at which information can be reliably transmitted is characterized by the maximum mutual information between the inputs and outputs of the channel. For the binary symmetric channel with crossover probability p𝑝pitalic_p, this means that there exist codes encoding k𝑘kitalic_k bits into n𝑛nitalic_n bits, tolerating error probability p𝑝pitalic_p in each bit if and only if

kn<1h(p)δ(n),𝑘𝑛1𝑝𝛿𝑛\frac{k}{n}<1-h(p)-\delta(n)\,,divide start_ARG italic_k end_ARG start_ARG italic_n end_ARG < 1 - italic_h ( italic_p ) - italic_δ ( italic_n ) ,

where h(p)=plogp(1p)log(1p)𝑝𝑝𝑝1𝑝1𝑝h(p)=-p\log p-(1-p)\log(1-p)italic_h ( italic_p ) = - italic_p roman_log italic_p - ( 1 - italic_p ) roman_log ( 1 - italic_p ) and δ(n)=o(1)𝛿𝑛𝑜1\delta(n)=o(1)italic_δ ( italic_n ) = italic_o ( 1 ). Computationally efficient codes achieving this bound were found later, most notably by Forney [For66]. We can use the code-offset construction 𝖲𝖲(w;x)=wC(x)𝖲𝖲𝑤𝑥direct-sum𝑤𝐶𝑥\mathsf{SS}(w;x)=w\oplus C(x)sansserif_SS ( italic_w ; italic_x ) = italic_w ⊕ italic_C ( italic_x ) with an appropriate concatenated code [For66] or, equivalently, 𝖲𝖲(w)=𝗌𝗒𝗇C(w)𝖲𝖲𝑤subscript𝗌𝗒𝗇𝐶𝑤\mathsf{SS}(w)={\mathsf{syn}}_{C}(w)sansserif_SS ( italic_w ) = sansserif_syn start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_w ) since the codes can be linear. We obtain:

Proposition 8.1.

For any error rate 0<p<1/20𝑝120<p<1/20 < italic_p < 1 / 2 and constant δ>0𝛿0\delta>0italic_δ > 0, for large enough n𝑛nitalic_n there exist secure sketches with entropy loss (h(p)+δ)n𝑝𝛿𝑛(h(p)+\delta)n( italic_h ( italic_p ) + italic_δ ) italic_n, which correct the error rate of p𝑝pitalic_p in the data with high probability (roughly 2cδnsuperscript2subscript𝑐𝛿𝑛2^{-c_{\delta}n}2 start_POSTSUPERSCRIPT - italic_c start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT italic_n end_POSTSUPERSCRIPT for a constant cδ>0subscript𝑐𝛿0c_{\delta}>0italic_c start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT > 0).

The probability here is taken over the errors only (the distribution on input strings w𝑤witalic_w can be arbitrary).

The quantity h(p)𝑝h(p)italic_h ( italic_p ) is less than 1 for any p𝑝pitalic_p in the range (0,1/2)012(0,1/2)( 0 , 1 / 2 ). In particular, one can get nontrivial secure sketches even for a very high error rate p𝑝pitalic_p as long as it is less than 1/2121/21 / 2; in contrast, no secure sketch which corrects errors with probability 1 can tolerate tn/4𝑡𝑛4t\geq n/4italic_t ≥ italic_n / 4. Note that several other works on biometric cryptosystems consider the model of randomized errors and obtain similar results, though the analyses assume that the distribution on inputs is uniform [TG04, CZ04].

A Matching Impossibility Result.  The bound above is tight. The matching impossibility result also applies to input-dependent and computationally bounded errors, since random errors are a special case of both more complex models.

We start with an intuitive argument: If a secure sketch allows recovering from random errors with high probability, then it must contain enough information about w𝑤witalic_w to describe the error pattern (since given wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and 𝖲𝖲(w)𝖲𝖲𝑤\mathsf{SS}(w)sansserif_SS ( italic_w ), one can recover the error pattern with high probability). Describing the outcome of n𝑛nitalic_n independent coin flips with probability p𝑝pitalic_p of heads requires nh(p)𝑛𝑝nh(p)italic_n italic_h ( italic_p ) bits, and so the sketch must reveal nh(p)𝑛𝑝nh(p)italic_n italic_h ( italic_p ) bits about w𝑤witalic_w.

In fact, that argument simply shows that nh(p)𝑛𝑝nh(p)italic_n italic_h ( italic_p ) bits of Shannon information are leaked about w𝑤witalic_w, whereas we are concerned with min-entropy loss as defined in Section 3. To make the argument more formal, let W𝑊Witalic_W be uniform over {0,1}nsuperscript01𝑛\{0,1\}^{n}{ 0 , 1 } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and observe that with high probability over the output of the sketching algorithm, v=𝖲𝖲(w)𝑣𝖲𝖲𝑤v=\mathsf{SS}(w)italic_v = sansserif_SS ( italic_w ), the conditional distribution Wv=W|𝖲𝖲(W)=vsubscript𝑊𝑣evaluated-at𝑊𝖲𝖲𝑊𝑣W_{v}=W|_{\mathsf{SS}(W)=v}italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = italic_W | start_POSTSUBSCRIPT sansserif_SS ( italic_W ) = italic_v end_POSTSUBSCRIPT forms a good code for the binary symmetric channel. That is, for most values v𝑣vitalic_v, if we sample a random string w𝑤witalic_w from W|𝖲𝖲(W)=vevaluated-at𝑊𝖲𝖲𝑊𝑣W|_{\mathsf{SS}(W)=v}italic_W | start_POSTSUBSCRIPT sansserif_SS ( italic_W ) = italic_v end_POSTSUBSCRIPT and send it through a binary symmetric channel, we will be able to recover the correct value w𝑤witalic_w. That means there exists some v𝑣vitalic_v such that both (a) Wvsubscript𝑊𝑣W_{v}italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is a good code and (b) 𝐇(Wv)subscript𝐇subscript𝑊𝑣{\mathbf{H}_{\infty}}(W_{v})bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) is close to 𝐇~(W|𝖲𝖲(W))subscript~𝐇conditional𝑊𝖲𝖲𝑊{\tilde{\mathbf{H}}_{\infty}}(W|\mathsf{SS}(W))over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_W | sansserif_SS ( italic_W ) ). Shannon’s noisy coding theorem says that such a code can have entropy at most n(1h(p)+o(1))𝑛1𝑝𝑜1n(1-h(p)+o(1))italic_n ( 1 - italic_h ( italic_p ) + italic_o ( 1 ) ). Thus the construction above is optimal:

Proposition 8.2.

For any error rate 0<p<1/20𝑝120<p<1/20 < italic_p < 1 / 2, any secure sketch 𝖲𝖲𝖲𝖲\mathsf{SS}sansserif_SS which corrects random errors (with rate p𝑝pitalic_p) with probability at least 2/3232/32 / 3 has entropy loss at least n(h(p)o(1))𝑛𝑝𝑜1n(h(p)-o(1))italic_n ( italic_h ( italic_p ) - italic_o ( 1 ) ); that is, 𝐇~(W|𝖲𝖲(W))n(1h(p)o(1))subscriptnormal-~𝐇conditional𝑊𝖲𝖲𝑊𝑛1𝑝𝑜1{\tilde{\mathbf{H}}_{\infty}}(W|\mathsf{SS}(W))\leq n(1-h(p)-o(1))over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_W | sansserif_SS ( italic_W ) ) ≤ italic_n ( 1 - italic_h ( italic_p ) - italic_o ( 1 ) ) when W𝑊Witalic_W is drawn uniformly from {0,1}nsuperscript01𝑛\{0,1\}^{n}{ 0 , 1 } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT.

8.2 Randomizing Input-dependent Errors

Assuming errors distributed randomly according to a known distribution seems very limiting. In the Hamming metric, one can construct a secure sketch which achieves the same result as with random errors for every error process where the magnitude of the error is bounded, as long as the errors are independent of the output of 𝖲𝖲(W)𝖲𝖲𝑊\mathsf{SS}(W)sansserif_SS ( italic_W ). The same technique was used previously by Bennett et al. [BBR88, p. 216] and, in a slightly different context, Lipton [Lip94, DGL04].

The idea is to choose a random permutation π:[n][n]:𝜋delimited-[]𝑛delimited-[]𝑛\pi:[n]\to[n]italic_π : [ italic_n ] → [ italic_n ], permute the bits of w𝑤witalic_w before applying the sketch, and store the permutation π𝜋\piitalic_π along with 𝖲𝖲(π(w))𝖲𝖲𝜋𝑤\mathsf{SS}(\pi(w))sansserif_SS ( italic_π ( italic_w ) ). Specifically, let C𝐶Citalic_C be a linear code tolerating a p𝑝pitalic_p fraction of random errors with redundancy nknh(p)𝑛𝑘𝑛𝑝n-k\approx nh(p)italic_n - italic_k ≈ italic_n italic_h ( italic_p ). Let

𝖲𝖲(w;π)=(π,𝗌𝗒𝗇C(π(w))),𝖲𝖲𝑤𝜋𝜋subscript𝗌𝗒𝗇𝐶𝜋𝑤\mathsf{SS}(w;\pi)=(\pi,\ {\mathsf{syn}}_{C}(\pi(w)))\,,sansserif_SS ( italic_w ; italic_π ) = ( italic_π , sansserif_syn start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_π ( italic_w ) ) ) ,

where π:[n][n]:𝜋delimited-[]𝑛delimited-[]𝑛\pi:[n]\to[n]italic_π : [ italic_n ] → [ italic_n ] is a random permutation and, for w=w1wn{0,1}n𝑤subscript𝑤1subscript𝑤𝑛superscript01𝑛w=w_{1}\cdots w_{n}\in\{0,1\}^{n}italic_w = italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, π(w)𝜋𝑤\pi(w)italic_π ( italic_w ) denotes the permuted string wπ(1)wπ(2)wπ(n)subscript𝑤𝜋1subscript𝑤𝜋2subscript𝑤𝜋𝑛w_{\pi(1)}w_{\pi(2)}\cdots w_{\pi(n)}italic_w start_POSTSUBSCRIPT italic_π ( 1 ) end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_π ( 2 ) end_POSTSUBSCRIPT ⋯ italic_w start_POSTSUBSCRIPT italic_π ( italic_n ) end_POSTSUBSCRIPT. The recovery algorithm operates in the obvious way: it first permutes the input wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT according to π𝜋\piitalic_π and then runs the usual syndrome recovery algorithm to recover π(w)𝜋𝑤\pi(w)italic_π ( italic_w ).

For any particular pair w,w𝑤superscript𝑤w,w^{\prime}italic_w , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, the difference wwdirect-sum𝑤superscript𝑤w\oplus w^{\prime}italic_w ⊕ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT will be mapped to a random vector of the same weight by π𝜋\piitalic_π, and any code for the binary symmetric channel (with rate pt/n𝑝𝑡𝑛p\approx t/nitalic_p ≈ italic_t / italic_n) will correct such an error with high probability.

Thus we can construct a sketch with entropy loss n(h(t/n)o(1))𝑛𝑡𝑛𝑜1n(h(t/n)-o(1))italic_n ( italic_h ( italic_t / italic_n ) - italic_o ( 1 ) ) which corrects any t𝑡titalic_t flipped bits with high probability. This is optimal by the lower bound for random errors (Proposition 8.2), since a sketch for data-dependent errors will also correct random errors. It is also possible to reduce the amount of randomness, so that the size of the sketch meets the same optimal bound [Smi07].

An alternative approach to input-dependent errors is discussed in the last paragraph of Section 8.3.

8.3 Handling Computationally Bounded Errors Via List Decoding

As mentioned above, many results on noisy coding for other error models in Hamming space extend to secure sketches. The previous sections discussed random, and randomized, errors. In this section, we discuss constructions [Gur03, Lan04, MPSW05] which transform a list-decodable code, defined below, into uniquely decodable codes for a particular error model. These transformations can also be used in the setting of secure sketches, leading to better tolerance of computationally bounded errors. For some ranges of parameters, this yields optimal sketches, that is, sketches which meet the Shannon bound on the fraction of tolerated errors.

List-Decodable Codes.  A code C𝐶Citalic_C in a metric space {\cal M}caligraphic_M is called list-decodable with list size L𝐿Litalic_L and distance t𝑡titalic_t if for every point x𝑥x\in{\cal M}italic_x ∈ caligraphic_M, there are at most L𝐿Litalic_L codewords within distance t𝑡titalic_t of {\cal M}caligraphic_M. A list-decoding algorithm takes as input a word x𝑥xitalic_x and returns the corresponding list c1,c2,subscript𝑐1subscript𝑐2c_{1},c_{2},\dotsitalic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … of codewords. The most interesting setting is when L𝐿Litalic_L is a small polynomial (in the description size log||\log|{\cal M}|roman_log | caligraphic_M |), and there exists an efficient list-decoding algorithm. It is then feasible for an algorithm to go over each word in the list and accept if it has some desirable property. There are many examples of such codes for the Hamming space; for a survey see Guruswami’s thesis [Gur01]. Recently there has been significant progress in constructing list-decodable codes for large alphabets, e.g., [PV05, GR06].

Similarly, we can define a list-decodable secure sketch with size L𝐿Litalic_L and distance t𝑡titalic_t as follows: for any pair of words w,w𝑤superscript𝑤w,w^{\prime}\in{\cal M}italic_w , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_M at distance at most t𝑡titalic_t, the algorithm 𝖱𝖾𝖼(w,𝖲𝖲(w))𝖱𝖾𝖼superscript𝑤𝖲𝖲𝑤\mathsf{Rec}(w^{\prime},\mathsf{SS}(w))sansserif_Rec ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , sansserif_SS ( italic_w ) ) returns a list of at most L𝐿Litalic_L points in {\cal M}caligraphic_M; if 𝖽𝗂𝗌(w,w)t𝖽𝗂𝗌𝑤superscript𝑤𝑡{\mathsf{dis}}(w,w^{\prime})\leq tsansserif_dis ( italic_w , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ italic_t, then one of the words in the list must be w𝑤witalic_w itself. The simplest way to obtain a list-decodable secure sketch is to use the code-offset construction of Section 5 with a list-decodable code for the Hamming space. One obtains a different example by running the improved Juels-Sudan scheme for set difference (Construction 5), replacing ordinary decoding of Reed-Solomon codes with list decoding. This yields a significant improvement in the number of errors tolerated at the price of returning a list of possible candidates for the original secret.

Sieving the List.  Given a list-decodable secure sketch 𝖲𝖲𝖲𝖲\mathsf{SS}sansserif_SS, all that’s needed is to store some additional information which allows the receiver to disambiguate w𝑤witalic_w from the list. Let’s suggestively name the additional information Tag(w;R)𝑇𝑎𝑔𝑤𝑅Tag(w;R)italic_T italic_a italic_g ( italic_w ; italic_R ), where R𝑅Ritalic_R is some additional randomness (perhaps a key). Given a list-decodable code C𝐶Citalic_C, the sketch will typically look like

𝖲𝖲(w;x)=(wC(x),Tag(w)).𝖲𝖲𝑤𝑥direct-sum𝑤𝐶𝑥𝑇𝑎𝑔𝑤\mathsf{SS}(w;x)=(\ w\oplus C(x),\ Tag(w)\ )\,.sansserif_SS ( italic_w ; italic_x ) = ( italic_w ⊕ italic_C ( italic_x ) , italic_T italic_a italic_g ( italic_w ) ) .

On inputs wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and (Δ,tag)Δ𝑡𝑎𝑔(\Delta,tag)( roman_Δ , italic_t italic_a italic_g ), the recovery algorithm consists of running the list-decoding algorithm on wΔdirect-sumsuperscript𝑤Δw^{\prime}\oplus\Deltaitalic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊕ roman_Δ to obtain a list of possible codewords C(x1),,C(xL)𝐶subscript𝑥1𝐶subscript𝑥𝐿C(x_{1}),\dots,C(x_{L})italic_C ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_C ( italic_x start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ). There is a corresponding list of candidate inputs w1,,wLsubscript𝑤1subscript𝑤𝐿w_{1},\dots,w_{L}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT, where wi=C(xi)Δsubscript𝑤𝑖direct-sum𝐶subscript𝑥𝑖Δw_{i}=C(x_{i})\oplus\Deltaitalic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_C ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⊕ roman_Δ, and the algorithm outputs the first wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the list such that Tag(wi)=tag𝑇𝑎𝑔subscript𝑤𝑖𝑡𝑎𝑔Tag(w_{i})=tagitalic_T italic_a italic_g ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_t italic_a italic_g. We will choose the function Tag()𝑇𝑎𝑔Tag()italic_T italic_a italic_g ( ) so that the adversary can not arrange to have two values in the list with valid tags.

We consider two Tag()𝑇𝑎𝑔Tag()italic_T italic_a italic_g ( ) functions, inspired by [Gur03, Lan04, MPSW05].

  1. 1.

    Recall that for computationally bounded errors, the corrupted string wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT depends on both w𝑤witalic_w and 𝖲𝖲(w)𝖲𝖲𝑤\mathsf{SS}(w)sansserif_SS ( italic_w ), but wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is computed by a probabilistic circuit of size polynomial in n𝑛nitalic_n.

    Consider Tag(w)=𝗁𝖺𝗌𝗁(w)𝑇𝑎𝑔𝑤𝗁𝖺𝗌𝗁𝑤Tag(w)=\mathsf{hash}(w)italic_T italic_a italic_g ( italic_w ) = sansserif_hash ( italic_w ), where 𝗁𝖺𝗌𝗁𝗁𝖺𝗌𝗁\mathsf{hash}sansserif_hash is drawn from a collision-resistant function family. More specifically, we will use some extra randomness r𝑟ritalic_r to choose a key 𝑘𝑒𝑦𝑘𝑒𝑦\mathit{key}italic_key for a collision-resistant hash family. The output of the sketch is then

    𝖲𝖲(w;x,r)=(wC(x),𝑘𝑒𝑦(r),𝗁𝖺𝗌𝗁𝑘𝑒𝑦(r)(w)).𝖲𝖲𝑤𝑥𝑟direct-sum𝑤𝐶𝑥𝑘𝑒𝑦𝑟subscript𝗁𝖺𝗌𝗁𝑘𝑒𝑦𝑟𝑤\mathsf{SS}(w;x,r)=(\ w\oplus C(x),\ \mathit{key}(r),\ \mathsf{hash}_{\mathit{key}(r)}(w)\ ).sansserif_SS ( italic_w ; italic_x , italic_r ) = ( italic_w ⊕ italic_C ( italic_x ) , italic_key ( italic_r ) , sansserif_hash start_POSTSUBSCRIPT italic_key ( italic_r ) end_POSTSUBSCRIPT ( italic_w ) ) .

    If the list-decoding algorithm for the code C𝐶Citalic_C runs in polynomial time, then the adversary succeeds only if he can find a value wiwsubscript𝑤𝑖𝑤w_{i}\neq witalic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ italic_w such that 𝗁𝖺𝗌𝗁𝑘𝑒𝑦(wi)=𝗁𝖺𝗌𝗁𝑘𝑒𝑦(w)subscript𝗁𝖺𝗌𝗁𝑘𝑒𝑦subscript𝑤𝑖subscript𝗁𝖺𝗌𝗁𝑘𝑒𝑦𝑤\mathsf{hash}_{\mathit{key}}(w_{i})=\mathsf{hash}_{\mathit{key}}(w)sansserif_hash start_POSTSUBSCRIPT italic_key end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = sansserif_hash start_POSTSUBSCRIPT italic_key end_POSTSUBSCRIPT ( italic_w ), that is, only by finding a collision for the hash function. By assumption, a polynomially bounded adversary succeeds only with negligible probability.

    The additional entropy loss, beyond that of the code-offset part of the sketch, is bounded above by the output length of the hash function. If α𝛼\alphaitalic_α is the desired bound on the adversary’s success probability, then for standard assumptions on hash functions this loss will be polynomial in log(1/α)1𝛼\log(1/\alpha)roman_log ( 1 / italic_α ).

    In principle this transformation can yield sketches which achieve the optimal entropy loss n(h(t/n)o(1))𝑛𝑡𝑛𝑜1n(h(t/n)-o(1))italic_n ( italic_h ( italic_t / italic_n ) - italic_o ( 1 ) ), since codes with polynomial list size L𝐿Litalic_L are known to exist for error rates approaching the Shannon bound. However, in order to use the construction the code must also be equipped with a reasonably efficient algorithm for finding such a list. This is necessary both so that recovery will be efficient and, more subtly, for the proof of security to go through (that way we can assume that the polynomial-time adversary knows the list of words generated during the recovery procedure). We do not know of efficient (i.e., polynomial-time constructible and decodable) binary list-decodable codes which meet the Shannon bound for all choices of parameters. However, when the error rate is near 1212\frac{1}{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG such codes are known [GS00]. Thus, this type of construction yields essentially optimal sketches when the error rate is near 1/2121/21 / 2. This is quite similar to analogous results on channel coding [MPSW05]. Relatively little is known about the performance of efficiently list-decodable codes in other parameter ranges for binary alphabets [Gur01].

  2. 2.

    A similar, even simpler, transformation can be used in the setting of input-dependent errors (i.e., when the errors depend only on the input and not on the sketch, but the adversary is not assumed to be computationally bounded). One can store Tag(w)=(I,hI(w))𝑇𝑎𝑔𝑤𝐼subscript𝐼𝑤Tag(w)=(I,h_{I}(w))italic_T italic_a italic_g ( italic_w ) = ( italic_I , italic_h start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_w ) ), where {hi}isubscriptsubscript𝑖𝑖\left\{{h_{i}}\right\}_{i\in{\cal I}}{ italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_I end_POSTSUBSCRIPT comes from a universal hash family mapping from {\cal M}caligraphic_M to {0,1}superscript01\{0,1\}^{\ell}{ 0 , 1 } start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT, where =log(1α)+logL1𝛼𝐿\ell=\log\left({\frac{1}{\alpha}}\right)+\log Lroman_ℓ = roman_log ( divide start_ARG 1 end_ARG start_ARG italic_α end_ARG ) + roman_log italic_L and α𝛼\alphaitalic_α is the probability of an incorrect decoding.

    The proof is simple: the values w1,,wLsubscript𝑤1subscript𝑤𝐿w_{1},\dots,w_{L}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT do not depend on I𝐼Iitalic_I, and so for any value wiwsubscript𝑤𝑖𝑤w_{i}\neq witalic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ italic_w, the probability that hI(wi)=hI(w)subscript𝐼subscript𝑤𝑖subscript𝐼𝑤h_{I}(w_{i})=h_{I}(w)italic_h start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_h start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_w ) is 2superscript22^{-\ell}2 start_POSTSUPERSCRIPT - roman_ℓ end_POSTSUPERSCRIPT. There are at most L𝐿Litalic_L possible candidates, and so the probability that any one of the elements in the list is accepted is at most L2=α𝐿superscript2𝛼L\cdot 2^{-\ell}=\alphaitalic_L ⋅ 2 start_POSTSUPERSCRIPT - roman_ℓ end_POSTSUPERSCRIPT = italic_α The additional entropy loss incurred is at most =log(1α)+log(L)1𝛼𝐿\ell=\log\left({\frac{1}{\alpha}}\right)+\log(L)roman_ℓ = roman_log ( divide start_ARG 1 end_ARG start_ARG italic_α end_ARG ) + roman_log ( italic_L ).

    In principle, this transformation can do as well as the randomization approach of the previous section. However, we do not know of efficient binary list-decodable codes meeting the Shannon bound for most parameter ranges. Thus, in general, randomizing the errors (as in the previous section) works better in the input-dependent setting.

9 Secure Sketches and Efficient Information Reconciliation

Suppose Alice holds a set w𝑤witalic_w and Bob holds a set wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT that are close to each other. They wish to reconcile the sets: to discover the symmetric difference ww𝑤superscript𝑤w\triangle w^{\prime}italic_w △ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT so that they can take whatever appropriate (application-dependent) action to make their two sets agree. Moreover, they wish to do this communication-efficiently, without having to transmit entire sets to each other. This problem is known as set reconciliation and naturally arises in various settings.

Let (𝖲𝖲,𝖱𝖾𝖼)𝖲𝖲𝖱𝖾𝖼(\mathsf{SS},\mathsf{Rec})( sansserif_SS , sansserif_Rec ) be a secure sketch for set difference that can handle distance up to t𝑡titalic_t; furthermore, suppose that |ww|t𝑤superscript𝑤𝑡|w\triangle w^{\prime}|\leq t| italic_w △ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | ≤ italic_t. Then if Bob receives s=𝖲𝖲(w)𝑠𝖲𝖲𝑤s=\mathsf{SS}(w)italic_s = sansserif_SS ( italic_w ) from Alice, he will be able to recover w𝑤witalic_w, and therefore ww𝑤superscript𝑤w\triangle w^{\prime}italic_w △ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, from s𝑠sitalic_s and wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Similarly, Alice will be able find ww𝑤superscript𝑤w\triangle w^{\prime}italic_w △ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT upon receiving s=𝖲𝖲(w)superscript𝑠𝖲𝖲superscript𝑤s^{\prime}=\mathsf{SS}(w^{\prime})italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = sansserif_SS ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) from Bob. This will be communication-efficient if |s|𝑠|s|| italic_s | is small. Note that our secure sketches for set difference of Sections 6.2 and  6.3 are indeed short—in fact, they are secure precisely because they are short. Thus, they also make good set reconciliation schemes.

Conversely, a good (single-message) set reconciliation scheme makes a good secure sketch: simply make the message the sketch. The entropy loss will be at most the length of the message, which is short in a communication-efficient scheme. Thus, the set reconciliation scheme CPISync of [MTZ03] makes a good secure sketch. In fact, it is quite similar to the secure sketch of Section 6.2, except instead of the top t𝑡titalic_t coefficients of the characteristic polynomial it uses the values of the polynomial at t𝑡titalic_t points.

PinSketch of Section 6.3, when used for set reconciliation, achieves the same parameters as CPISync of [MTZ03], except decoding is faster, because instead of spending t3superscript𝑡3t^{3}italic_t start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT time to solve a system of linear equations, it spends t2superscript𝑡2t^{2}italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT time for Euclid’s algorithm. Thus, it can be substituted wherever CPISync is used, such as PDA synchronization [STA03] and PGP key server updates [Min04]. Furthermore, optimizations that improve computational complexity of CPISync through the use of interaction [MT02] can also be applied to PinSketch.

Of course, secure sketches for other metrics are similarly related to information reconciliation for those metrics. In particular, ideas for edit distance very similar to ours were independently considered in the context of information reconciliation by [CT04].

Acknowledgments

This work evolved over several years and discussions with many people enriched our understanding of the material at hand. In roughly chronological order, we thank Piotr Indyk for discussions about embeddings and for his help in the proof of Lemma 7.3; Madhu Sudan, for helpful discussions about the construction of [JS06] and the uses of error-correcting codes; Venkat Guruswami, for enlightenment about list decoding; Pim Tuyls, for pointing out relevant previous work; Chris Peikert, for pointing out the model of computationally bounded adversaries from [MPSW05]; Ari Trachtenberg, for finding an error in the preliminary version of Appendix E; Ronny Roth, for discussions about efficient BCH decoding; Kevin Harmon and Soren Johnson, for their implementation work; and Silvio Micali and anonymous referees, for suggestions on presenting our results.

The work of the Y.D. was partly funded by the National Science Foundation under CAREER Award No. CCR-0133806 and Trusted Computing Grant No. CCR-0311095, and by the New York University Research Challenge Fund 25-74100-N5237. The work of the L.R. was partly funded by the National Science Foundation under Grant Nos. CCR-0311485, CCF-0515100 and CNS-0202067. The work of the A.S. at MIT was partly funded by US A.R.O. grant DAAD19-00-1-0177 and by a Microsoft Fellowship. While at the Weizmann Institute, A.S. was supported by the Louis L. and Anita M. Perlman Postdoctoral Fellowship.

References

  • [AK07] Alexandr Andoni and Robi Krauthgamer. The computational hardness of estimating edit distance. In IEEE Symposium on the Foundations of Computer Science (FOCS), pages 724–734, 2007.
  • [AVZ00] Erik Agrell, Alexander Vardy, and Kenneth Zeger. Upper bounds for constant-weight codes. IEEE Transactions on Information Theory, 46(7):2373–2395, 2000.
  • [BBCM95] Charles H. Bennett, Gilles Brassard, Claude Crépeau, and Ueli M. Maurer. Generalized privacy amplification. IEEE Transactions on Information Theory, 41(6):1915–1923, 1995.
  • [BBCS91] Charles H. Bennett, Gilles Brassard, Claude Crépeau, and Marie-Hélène Skubiszewska. Practical quantum oblivious transfer. In J. Feigenbaum, editor, Advances in Cryptology—CRYPTO ’91, volume 576 of Lecture Notes in Computer Science, pages 351–366. Springer-Verlag, 1992, 11–15 August 1991.
  • [BBR88] C. Bennett, G. Brassard, and J. Robert. Privacy amplification by public discussion. SIAM Journal on Computing, 17(2):210–229, 1988.
  • [BCN04] C. Barral, J.-S. Coron, and D. Naccache. Externalized fingerprint matching. Technical Report 2004/021, Cryptology e-print archive, http://eprint.iacr.org, 2004.
  • [BDK+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT05] Xavier Boyen, Yevgeniy Dodis, Jonathan Katz, Rafail Ostrovsky, and Adam Smith. Secure remote authentication using biometric data. In Ronald Cramer, editor, Advances in Cryptology—EUROCRYPT 2005, volume 3494 of Lecture Notes in Computer Science, pages 147–163. Springer-Verlag, 2005.
  • [Bla83] Richard E. Blahut. Theory and practice of error control codes. Addison Wesley Longman, Reading, MA, 1983. 512 p.
  • [Boy04] Xavier Boyen. Reusable cryptographic fuzzy extractors. In Eleventh ACM Conference on Computer and Communication Security, pages 82–91. ACM, October 25–29 2004.
  • [Bro97] Andrei Broder. On the resemblence and containment of documents. In Compression and Complexity of Sequences, Washington, DC, 1997. IEEE Computer Society.
  • [BSSS90] Andries E. Brouwer, James B. Shearer, Neil J. A. Sloane, and Warren D. Smith. A new table of constant weight codes. IEEE Transactions on Information Theory, 36(6):1334–1380, 1990.
  • [CFL06] Ee-Chien Chang, Vadym Fedyukovych, and Qiming Li. Secure sketch for multi-sets. Technical Report 2006/090, Cryptology e-print archive, http://eprint.iacr.org, 2006.
  • [CG88] Benny Chor and Oded Goldreich. Unbiased bits from sources of weak randomness and probabilistic communication complexity. SIAM Journal on Computing, 17(2):230–261, 1988.
  • [CK03] L. Csirmaz and G.O.H. Katona. Geometrical cryptography. In Proc. International Workshop on Coding and Cryptography, 2003.
  • [CL06] Ee-Chien Chang and Qiming Li. Hiding secret points amidst chaff. In Serge Vaudenay, editor, Advances in Cryptology—EUROCRYPT 2006, volume 4004 of Lecture Notes in Computer Science, pages 59–72. Springer-Verlag, 2006.
  • [Cré97] Claude Crépeau. Efficient cryptographic protocols based on noisy channels. In Walter Fumy, editor, Advances in Cryptology—EUROCRYPT 97, volume 1233 of Lecture Notes in Computer Science, pages 306–317. Springer-Verlag, 11–15 May 1997.
  • [CT04] V. Chauhan and A. Trachtenberg. Reconciliation puzzles. In IEEE Globecom, Dallas, TX, pages 600–604, 2004.
  • [CW79] J.L. Carter and M.N. Wegman. Universal classes of hash functions. Journal of Computer and System Sciences, 18:143–154, 1979.
  • [CZ04] Gérard Cohen and Gilles Zémor. Generalized coset schemes for the wire-tap channel: Application to biometrics. In IEEE International Symp. on Information Theory, page 45, 2004.
  • [DFMP99] G.I. Davida, Y. Frankel, B.J. Matt, and R. Peralta. On the relation of error correction and cryptography to an off line biometric based identification scheme. In Proceedings of WCC99, Workshop on Coding and Cryptography, Paris, France, 11-14 January 1999. Available at http://citeseer.ist.psu.edu/389295.html.
  • [DGL04] Yan Zhong Ding, P. Gopalan, and Richard J. Lipton. Error correction against computationally bounded adversaries. Manuscript. Appeared initially as [Lip94]; to appear in Theory of Computing Systems, 2004.
  • [Din05] Yan Zong Ding. Error correction in the bounded storage model. In Joe Kilian, editor, TCC, volume 3378 of Lecture Notes in Computer Science, pages 578–599. Springer, 2005.
  • [DKRS06] Yevgeniy Dodis, Jonathan Katz, Leonid Reyzin, and Adam Smith. Robust fuzzy extractors and authenticated key agreement from close secrets. In Cynthia Dwork, editor, Advances in Cryptology—CRYPTO 2006, volume 4117 of Lecture Notes in Computer Science, pages 232–250. Springer-Verlag, 20–24 August 2006.
  • [DORS06] Yevgeniy Dodis, Rafail Ostrovsky, Leonid Reyzin, and Adam Smith. Fuzzy extractors: How to generate strong keys from biometrics and other noisy data. Technical Report 2003/235, Cryptology ePrint archive, http://eprint.iacr.org, 2006. Previous version appeared at EUROCRYPT 2004.
  • [DRS04] Yevgeniy Dodis, Leonid Reyzin, and Adam Smith. Fuzzy extractors: How to generate strong keys from biometrics and other noisy data. In Christian Cachin and Jan Camenisch, editors, Advances in Cryptology—EUROCRYPT 2004, volume 3027 of Lecture Notes in Computer Science, pages 79–100. Springer-Verlag, 2004.
  • [DRS07] Yevgeniy Dodis, Leonid Reyzin, and Adam Smith. Fuzzy extractors. In Security with Noisy Data, 2007.
  • [DS05] Yevgeniy Dodis and Adam Smith. Correcting errors without leaking partial information. In Harold N. Gabow and Ronald Fagin, editors, STOC, pages 654–663. ACM, 2005.
  • [EHMS00] Carl Ellison, Chris Hall, Randy Milbert, and Bruce Schneier. Protecting keys with personal entropy. Future Generation Computer Systems, 16:311–318, February 2000.
  • [FJ01] Niklas Frykholm and Ari Juels. Error-tolerant password recovery. In Eighth ACM Conference on Computer and Communication Security, pages 1–8. ACM, November 5–8 2001.
  • [For66] G. David Forney. Concatenated Codes. PhD thesis, MIT, 1966.
  • [Fry00] N. Frykholm. Passwords: Beyond the terminal interaction model. Master’s thesis, Umeå University, 2000.
  • [GR06] Venkatesan Guruswami and Atri Rudra. Explicit capacity-achieving list-decodable codes. In Jon M. Kleinberg, editor, STOC, pages 1–10. ACM, 2006.
  • [GS00] Venkatesan Guruswami and Madhu Sudan. List decoding algorithms for certain concatenated codes. In Proceedings of the Thirty-Second Annual ACM Symposium on Theory of Computing, pages 181–190, Portland, Oregon, 21–23 May 2000.
  • [Gur01] V. Guruswami. List Decoding of Error-Correcting Codes. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 2001.
  • [Gur03] Venkatesan Guruswami. List decoding with side information. In IEEE Conference on Computational Complexity, pages 300–. IEEE Computer Society, 2003.
  • [HILL99] J. Håstad, R. Impagliazzo, L.A. Levin, and M. Luby. A pseudorandom generator from any one-way function. SIAM Journal on Computing, 28(4):1364–1396, 1999.
  • [HJR06] Kevin Harmon, Soren Johnson, and Leonid Reyzin. An implementation of syndrome encoding and decoding for binary BCH codes, secure sketches and fuzzy extractors, 2006. Available at http://www.cs.bu.edu/~reyzin/code/fuzzy.html.
  • [JS06] Ari Juels and Madhu Sudan. A fuzzy vault scheme. Designs, Codes and Cryptography, 38(2):237–257, 2006.
  • [JW99] Ari Juels and Martin Wattenberg. A fuzzy commitment scheme. In Tsudik [Tsu99], pages 28–36.
  • [KO63] A.A. Karatsuba and Y. Ofman. Multiplication of multidigit numbers on automata. Soviet Physics Doklady, 7:595–596, 1963.
  • [KS95] E. Kaltofen and V. Shoup. Subquadratic-time factoring of polynomials over finite fields. In Proceedings of the Twenty-Seventh Annual ACM Symposium on the Theory of Computing, pages 398–406, Las Vegas, Nevada, 29May–1June 1995.
  • [KSHW97] John Kelsey, Bruce Schneier, Chris Hall, and David Wagner. Secure applications of low-entropy keys. In Eiji Okamoto, George I. Davida, and Masahiro Mambo, editors, ISW, volume 1396 of Lecture Notes in Computer Science, pages 121–134. Springer, 1997.
  • [Lan04] Michael Langberg. Private codes or succinct random codes that are (almost) perfect. In FOCS ’04: Proceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science (FOCS’04), pages 325–334, Washington, DC, USA, 2004. IEEE Computer Society.
  • [Lip94] Richard J. Lipton. A new approach to information theory. In Patrice Enjalbert, Ernst W. Mayr, and Klaus W. Wagner, editors, STACS, volume 775 of Lecture Notes in Computer Science, pages 699–708. Springer, 1994. The full version of this paper is in preparation [DGL04].
  • [LSM06] Qiming Li, Yagiz Sutcu, and Nasir Memon. Secure sketch for biometric templates. In Advances in Cryptology—ASIACRYPT 2006, volume 4284 of Lecture Notes in Computer Science, pages 99–113, Shanghai, China, 3–7 December 2006. Springer-Verlag.
  • [LT03] J.-P. M. G. Linnartz and P. Tuyls. New shielding functions to enhance privacy and prevent misuse of biometric templates. In AVBPA, pages 393–402, 2003.
  • [Mau93] Ueli Maurer. Secret key agreement by public discussion from common information. IEEE Transactions on Information Theory, 39(3):733–742, 1993.
  • [Min04] Yaron Minsky. The SKS OpenPGP key server v1.0.5, March 2004. http://www.nongnu.org/sks.
  • [MPSW05] Silvio Micali, Chris Peikert, Madhu Sudan, and David Wilson. Optimal error correction against computationally bounded noise. In Joe Kilian, editor, First Theory of Cryptography Conference — TCC 2005, volume 3378 of Lecture Notes in Computer Science, pages 1–16. Springer-Verlag, February 10–12 2005.
  • [MRLW01a] Fabian Monrose, Michael K. Reiter, Qi Li, and Susanne Wetzel. Cryptographic key generation from voice. In Martin Abadi and Roger Needham, editors, IEEE Symposium on Security and Privacy, pages 202–213, 2001.
  • [MRLW01b] Fabian Monrose, Michael K. Reiter, Qi Li, and Susanne Wetzel. Using voice to generate cryptographic keys. In 2001: A Speaker Odyssey. The Speaker Recognition Workshop, pages 237–242, Crete, Greece, 2001.
  • [MRW99] Fabian Monrose, Michael K. Reiter, and Susanne Wetzel. Password hardening based on keystroke dynamics. In Tsudik [Tsu99], pages 73–82.
  • [MT79] Robert Morris and Ken Thomson. Password security: A case history. Communications of the ACM, 22(11):594–597, 1979.
  • [MT02] Yaron Minsky and Ari Trachtenberg. Scalable set reconciliation. In 40th Annual Allerton Conference on Communication, Control and Computing, Monticello, IL, pages 1607–1616, October 2002. See also tehcnial report BU-ECE-2002-01.
  • [MTZ03] Yaron Minsky, Ari Trachtenberg, and Richard Zippel. Set reconciliation with nearly optimal communication complexity. IEEE Transactions on Information Theory, 49(9):2213–2218, 2003.
  • [NZ96] Noam Nisan and David Zuckerman. Randomness is linear in space. Journal of Computer and System Sciences, 52(1):43–53, 1996.
  • [OR05] Rafail Ostrovsky and Yuval Rabani. Low distortion embeddings for edit distance. In Proceedings of the Thirty-Seventh Annual ACM Symposium on Theory of Computing, pages 218–224, Baltimore, Maryland, 22–24 May 2005.
  • [PV05] Farzad Parvaresh and Alexander Vardy. Correcting errors beyond the guruswami-sudan radius in polynomial time. In FOCS, pages 285–294. IEEE Computer Society, 2005.
  • [Rey07] Leonid Reyzin. Entropy Loss is Maximal for Uniform Inputs. Technical Report BUCS-TR-2007-011, CS Department, Boston University, 2007. Available from http://www.cs.bu.edu/techreports/.
  • [RTS00] Jaikumar Radhakrishnan and Amnon Ta-Shma. Bounds for dispersers, extractors, and depth-two superconcentrators. SIAM Journal on Discrete Mathematics, 13(1):2–24, 2000.
  • [RW04] Renato Renner and Stefan Wolf. Smooth rényi entropy and applications. In Proceedings of IEEE International Symposium on Information Theory, page 233, June 2004.
  • [RW05] Renato Renner and Stefan Wolf. Simple and tight bounds for information reconciliation and privacy amplification. In Bimal Roy, editor, Advances in Cryptology—ASIACRYPT 2005, Lecture Notes in Computer Science, pages 199–216, Chennai, India, 4–8 December 2005. Springer-Verlag.
  • [Sha48] Claude E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:379–423 and 623–656, July and October 1948. Reprinted in D. Slepian, editor, Key Papers in the Development of Information Theory, IEEE Press, NY, 1974.
  • [Sha02] Ronen Shaltiel. Recent developments in explicit constructions of extractors. Bulletin of the EATCS, 77:67–95, 2002.
  • [Sho01] Victor Shoup. A proposal for an ISO standard for public key encryption. Available at http://eprint.iacr.org/2001/112, 2001.
  • [Sho05] Victor Shoup. A Computational Introduction to Number Theory and Algebra. Cambridge University Press, 2005. Available from http://shoup.net.
  • [SKHN75] Yasuo Sugiyama, Masao Kasahara, Shigeichi Hirasawa, and Toshihiko Namekawa. A method for solving key equation for decoding Goppa codes. Information and Control, 27(1):87–99, 1975.
  • [Smi07] Adam Smith. Scrambling adversarial errors using few random bits. In H. Gabow, editor, ACM–SIAM Symposium on Discrete Algorithms (SODA), 2007.
  • [STA03] David Starobinski, Ari Trachtenberg, and Sachin Agarwal. Efficient PDA synchronization. IEEE Transactions on Mobile Computing, 2(1):40–51, 2003.
  • [Sud01] Madhu Sudan. Lecture notes for an algorithmic introduction to coding theory. Course taught at MIT, December 2001.
  • [TG04] Pim Tuyls and Jasper Goseling. Capacity and examples of template-protecting biometric authentication systems. In Davide Maltoni and Anil K. Jain, editors, ECCV Workshop BioAW, volume 3087 of Lecture Notes in Computer Science, pages 158–170. Springer, 2004.
  • [Tsu99] Gene Tsudik, editor. Sixth ACM Conference on Computer and Communication Security. ACM, November 1999.
  • [vL92] J.H. van Lint. Introduction to Coding Theory. Springer-Verlag, 1992.
  • [VTDL03] E. Verbitskiy, P. Tuyls, D. Denteneer, and J.-P. Linnartz. Reliable biometric authentication with privacy protection. In Proc. 24th Benelux Symposium on Information theory. Society for Information Theory in the Benelux, 2003.
  • [vzGG03] Joachim von zur Gathen and Jürgen Gerhard. Modern Computer Algebra. Cambridge University Press, 2003.
  • [WC81] M.N. Wegman and J.L. Carter. New hash functions and their use in authentication and set equality. Journal of Computer and System Sciences, 22:265–279, 1981.

Appendix A Proof of Lemma 2.2

Recall that Lemma 2.2 considered random variables A,B,C𝐴𝐵𝐶A,B,Citalic_A , italic_B , italic_C and consisted of two parts, which we prove one after the other.

Part (a) stated that for any δ>0𝛿0\delta>0italic_δ > 0, the conditional entropy 𝐇(A|B=b)subscript𝐇conditional𝐴𝐵𝑏{\mathbf{H}_{\infty}}(A|B=b)bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_A | italic_B = italic_b ) is at least 𝐇~(A|B)log(1/δ)subscript~𝐇conditional𝐴𝐵1𝛿{\tilde{\mathbf{H}}_{\infty}}(A|B)-\log(1/\delta)over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_A | italic_B ) - roman_log ( 1 / italic_δ ) with probability at least 1δ1𝛿1-\delta1 - italic_δ (the probability here is taken over the choice of b𝑏bitalic_b). Let p=2𝐇~(AB)=𝔼b[2𝐇(AB=b)]𝑝superscript2subscript~𝐇conditional𝐴𝐵subscript𝔼𝑏delimited-[]superscript2subscript𝐇conditional𝐴𝐵𝑏p=2^{-{\tilde{\mathbf{H}}_{\infty}}(A\mid B)}={\mathbb{E}}_{{b}}\left[{2^{-{\mathbf{H}_{\infty}}(A\mid B=b)}}\right]italic_p = 2 start_POSTSUPERSCRIPT - over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_A ∣ italic_B ) end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT [ 2 start_POSTSUPERSCRIPT - bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_A ∣ italic_B = italic_b ) end_POSTSUPERSCRIPT ]. By the Markov inequality, 2𝐇(AB=b)p/δsuperscript2subscript𝐇conditional𝐴𝐵𝑏𝑝𝛿2^{-{\mathbf{H}_{\infty}}(A\mid B=b)}\leq p/\delta2 start_POSTSUPERSCRIPT - bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_A ∣ italic_B = italic_b ) end_POSTSUPERSCRIPT ≤ italic_p / italic_δ with probability at least 1δ1𝛿1-\delta1 - italic_δ. Taking logarithms, part (a) follows.

Part (b) stated that if B𝐵Bitalic_B has at most 2λsuperscript2𝜆2^{\lambda}2 start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT possible values, then 𝐇~(A(B,C))𝐇~((A,B)C)λ𝐇~(AC)λsubscript~𝐇conditional𝐴𝐵𝐶subscript~𝐇conditional𝐴𝐵𝐶𝜆subscript~𝐇conditional𝐴𝐶𝜆{\tilde{\mathbf{H}}_{\infty}}(A\mid(B,C))\geq{\tilde{\mathbf{H}}_{\infty}}((A,B)\mid C)-{\lambda}\geq{\tilde{\mathbf{H}}_{\infty}}(A\mid C)-{\lambda}over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_A ∣ ( italic_B , italic_C ) ) ≥ over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( ( italic_A , italic_B ) ∣ italic_C ) - italic_λ ≥ over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_A ∣ italic_C ) - italic_λ. In particular, 𝐇~(AB)𝐇((A,B))λ𝐇(A)λsubscript~𝐇conditional𝐴𝐵subscript𝐇𝐴𝐵𝜆subscript𝐇𝐴𝜆{\tilde{\mathbf{H}}_{\infty}}(A\mid B)\geq{\mathbf{H}_{\infty}}((A,B))-{\lambda}\geq{\mathbf{H}_{\infty}}(A)-{\lambda}over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_A ∣ italic_B ) ≥ bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( ( italic_A , italic_B ) ) - italic_λ ≥ bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_A ) - italic_λ. Clearly, it suffices to prove the first assertion (the second follows from taking C𝐶Citalic_C to be constant). Moreover, the second inequality of the first assertion follows from the fact that Pr[A=aB=bC=c]Pr[A=aC=c]Pr𝐴𝑎𝐵conditional𝑏𝐶𝑐Pr𝐴conditional𝑎𝐶𝑐\Pr[A=a\wedge B=b\mid C=c]\leq\Pr[A=a\mid C=c]roman_Pr [ italic_A = italic_a ∧ italic_B = italic_b ∣ italic_C = italic_c ] ≤ roman_Pr [ italic_A = italic_a ∣ italic_C = italic_c ], for any c𝑐citalic_c. Thus, we prove only that 𝐇~(A(B,C))𝐇~((A,B)C)λsubscript~𝐇conditional𝐴𝐵𝐶subscript~𝐇conditional𝐴𝐵𝐶𝜆{\tilde{\mathbf{H}}_{\infty}}(A\mid(B,C))\geq{\tilde{\mathbf{H}}_{\infty}}((A,B)\mid C)-{\lambda}over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_A ∣ ( italic_B , italic_C ) ) ≥ over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( ( italic_A , italic_B ) ∣ italic_C ) - italic_λ:

𝐇~(A(B,C))subscript~𝐇conditional𝐴𝐵𝐶\displaystyle{\tilde{\mathbf{H}}_{\infty}}(A\mid(B,C))over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_A ∣ ( italic_B , italic_C ) ) =\displaystyle== log𝔼(b,c)(B,C)[maxaPr[A=aB=bC=c]]subscript𝔼𝑏𝑐𝐵𝐶delimited-[]subscript𝑎Pr𝐴conditional𝑎𝐵𝑏𝐶𝑐\displaystyle-\log{\mathbb{E}}_{{(b,c)\leftarrow(B,C)}}\left[{\max_{a}\Pr[A=a\mid B=b\wedge C=c]}\right]- roman_log blackboard_E start_POSTSUBSCRIPT ( italic_b , italic_c ) ← ( italic_B , italic_C ) end_POSTSUBSCRIPT [ roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT roman_Pr [ italic_A = italic_a ∣ italic_B = italic_b ∧ italic_C = italic_c ] ]
=\displaystyle== log(b,c)maxaPr[A=aB=bC=c]Pr[B=bC=c]subscript𝑏𝑐subscript𝑎Pr𝐴conditional𝑎𝐵𝑏𝐶𝑐Pr𝐵𝑏𝐶𝑐\displaystyle-\log\sum_{(b,c)}\max_{a}\Pr[A=a\mid B=b\wedge C=c]\Pr[B=b\wedge C=c]- roman_log ∑ start_POSTSUBSCRIPT ( italic_b , italic_c ) end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT roman_Pr [ italic_A = italic_a ∣ italic_B = italic_b ∧ italic_C = italic_c ] roman_Pr [ italic_B = italic_b ∧ italic_C = italic_c ]
=\displaystyle== log(b,c)maxaPr[A=aB=bC=c]Pr[C=c]subscript𝑏𝑐subscript𝑎Pr𝐴𝑎𝐵conditional𝑏𝐶𝑐Pr𝐶𝑐\displaystyle-\log\sum_{(b,c)}\max_{a}\Pr[A=a\wedge B=b\mid C=c]\Pr[C=c]- roman_log ∑ start_POSTSUBSCRIPT ( italic_b , italic_c ) end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT roman_Pr [ italic_A = italic_a ∧ italic_B = italic_b ∣ italic_C = italic_c ] roman_Pr [ italic_C = italic_c ]
=\displaystyle== logb𝔼cC[maxaPr[A=aB=bC=c]]subscript𝑏subscript𝔼𝑐𝐶delimited-[]subscript𝑎Pr𝐴𝑎𝐵conditional𝑏𝐶𝑐\displaystyle-\log\sum_{b}{\mathbb{E}}_{{c\leftarrow C}}\left[{\max_{a}\Pr[A=a\wedge B=b\mid C=c]}\right]- roman_log ∑ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_c ← italic_C end_POSTSUBSCRIPT [ roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT roman_Pr [ italic_A = italic_a ∧ italic_B = italic_b ∣ italic_C = italic_c ] ]
\displaystyle\geq logb𝔼cC[maxa,bPr[A=aB=bC=c]]subscript𝑏subscript𝔼𝑐𝐶delimited-[]subscript𝑎superscript𝑏Pr𝐴𝑎𝐵conditionalsuperscript𝑏𝐶𝑐\displaystyle-\log\sum_{b}{\mathbb{E}}_{{c\leftarrow C}}\left[{\max_{a,b^{\prime}}\Pr[A=a\wedge B=b^{\prime}\mid C=c]}\right]- roman_log ∑ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_c ← italic_C end_POSTSUBSCRIPT [ roman_max start_POSTSUBSCRIPT italic_a , italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Pr [ italic_A = italic_a ∧ italic_B = italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_C = italic_c ] ]
=\displaystyle== logb2𝐇~((A,B)C)log2λ2𝐇~((A,B)C)=𝐇~((A,B)C)λ.subscript𝑏superscript2subscript~𝐇conditional𝐴𝐵𝐶superscript2𝜆superscript2subscript~𝐇conditional𝐴𝐵𝐶subscript~𝐇conditional𝐴𝐵𝐶𝜆\displaystyle-\log\sum_{b}2^{-{\tilde{\mathbf{H}}_{\infty}}((A,B)\mid C)}\geq-\log 2^{\lambda}2^{-{\tilde{\mathbf{H}}_{\infty}}((A,B)\mid C)}={\tilde{\mathbf{H}}_{\infty}}((A,B)\mid C)-{\lambda}\,.- roman_log ∑ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT - over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( ( italic_A , italic_B ) ∣ italic_C ) end_POSTSUPERSCRIPT ≥ - roman_log 2 start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT - over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( ( italic_A , italic_B ) ∣ italic_C ) end_POSTSUPERSCRIPT = over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( ( italic_A , italic_B ) ∣ italic_C ) - italic_λ .

The first inequality in the above derivation holds since taking the maximum over all pairs (a,b)𝑎superscript𝑏(a,b^{\prime})( italic_a , italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) (instead of over pairs (a,b)𝑎𝑏(a,b)( italic_a , italic_b ) where b𝑏bitalic_b is fixed) increases the terms of the sum and hence decreases the negative log of the sum.

Appendix B On Smooth Variants of Average Min-Entropy and the Relationship to Smooth Rényi Entropy

Min-entropy is a rather fragile measure: a single high-probability element can ruin the min-entropy of an otherwise good distribution. This is often circumvented within proofs by considering a distribution which is close to the distribution of interest, but which has higher entropy. Renner and Wolf [RW04] systematized this approach with the notion of ϵitalic-ϵ\epsilonitalic_ϵ-smooth min-entropy (they use the term “Rényi entropy of order \infty” instead of “min-entropy”), which considers all distributions that are ϵitalic-ϵ\epsilonitalic_ϵ-close:

𝐇ϵ(A)=maxB:𝐒𝐃(A,B)ϵ𝐇(B).superscriptsubscript𝐇italic-ϵ𝐴subscript:𝐵𝐒𝐃𝐴𝐵italic-ϵsubscript𝐇𝐵{\mathbf{H}_{\infty}^{\epsilon}}(A)=\max_{B:\ \mathbf{SD}\left({{A,B}}\right)\leq\epsilon}{\mathbf{H}_{\infty}}(B)\,.bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT ( italic_A ) = roman_max start_POSTSUBSCRIPT italic_B : bold_SD ( italic_A , italic_B ) ≤ italic_ϵ end_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_B ) .

Smooth min-entropy very closely relates to the amount of extractable nearly uniform randomness: if one can map A𝐴Aitalic_A to a distribution that is ϵitalic-ϵ\epsilonitalic_ϵ-close to Umsubscript𝑈𝑚U_{m}italic_U start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, then 𝐇ϵ(A)msuperscriptsubscript𝐇italic-ϵ𝐴𝑚{\mathbf{H}_{\infty}^{\epsilon}}(A)\geq mbold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT ( italic_A ) ≥ italic_m; conversely, from any A𝐴Aitalic_A such that 𝐇ϵ(A)msuperscriptsubscript𝐇italic-ϵ𝐴𝑚{\mathbf{H}_{\infty}^{\epsilon}}(A)\geq mbold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT ( italic_A ) ≥ italic_m, and for any ϵ2subscriptitalic-ϵ2\epsilon_{2}italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, one can extract m2log(1ϵ2)𝑚21subscriptitalic-ϵ2m-2\log\left({\frac{1}{\epsilon_{2}}}\right)italic_m - 2 roman_log ( divide start_ARG 1 end_ARG start_ARG italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ) bits that are ϵ+ϵ2italic-ϵsubscriptitalic-ϵ2\epsilon+\epsilon_{2}italic_ϵ + italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-close to uniform (see [RW04] for a more precise statement; the proof of the first statement follows by considering the inverse map, and the proof of the second from the leftover hash lemma, which is discussed in more detail in Lemma 2.4). For some distributions, considering the smooth min-entropy will improve the number and quality of extractable random bits.

A smooth version of average min-entropy can also be considered, defined as

𝐇~ϵ(AB)=max(C,D):𝐒𝐃((A,B),(C,D))ϵ𝐇~(CD).superscriptsubscript~𝐇italic-ϵconditional𝐴𝐵subscript:𝐶𝐷𝐒𝐃𝐴𝐵𝐶𝐷italic-ϵsubscript~𝐇conditional𝐶𝐷{\tilde{\mathbf{H}}_{\infty}^{\epsilon}}(A\mid B)=\allowbreak\max_{(C,D):\ \mathbf{SD}\left({{(A,B),(C,D)}}\right)\leq\epsilon}{\tilde{\mathbf{H}}_{\infty}}(C\mid D)\,.over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT ( italic_A ∣ italic_B ) = roman_max start_POSTSUBSCRIPT ( italic_C , italic_D ) : bold_SD ( ( italic_A , italic_B ) , ( italic_C , italic_D ) ) ≤ italic_ϵ end_POSTSUBSCRIPT over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_C ∣ italic_D ) .

It similarly relates very closely to the number of extractable bits that look nearly uniform to the adversary who knows the value of B𝐵Bitalic_B, and is therefore perhaps a better measure for the quality of a secure sketch that is used to obtain a fuzzy extractor. All our results can be cast in terms of smooth entropies throughout, with appropriate modifications (if input entropy is ϵitalic-ϵ\epsilonitalic_ϵ-smooth, then output entropy will also be ϵitalic-ϵ\epsilonitalic_ϵ-smooth, and extracted random strings will be ϵitalic-ϵ\epsilonitalic_ϵ further away from uniform). We avoid doing so for simplicity of exposition. However, for some input distributions, particularly ones with few elements of relatively high probability, this will improve the result by giving more secure sketches or longer-output fuzzy extractors.

Finally, a word is in order on the relation of average min-entropy to conditional min-entropy, introduced by Renner and Wolf in [RW05], and defined as 𝐇(AB)=logmaxa,bPr(A=aB=b)=minb𝐇(AB=b)subscript𝐇conditional𝐴𝐵subscript𝑎𝑏Pr𝐴conditional𝑎𝐵𝑏subscript𝑏subscript𝐇conditional𝐴𝐵𝑏{\mathbf{H}_{\infty}}(A\mid B)=-\log\max_{a,b}\Pr(A=a\mid B=b)=\min_{b}{\mathbf{H}_{\infty}}(A\mid B=b)bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_A ∣ italic_B ) = - roman_log roman_max start_POSTSUBSCRIPT italic_a , italic_b end_POSTSUBSCRIPT roman_Pr ( italic_A = italic_a ∣ italic_B = italic_b ) = roman_min start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_A ∣ italic_B = italic_b ) (an ϵitalic-ϵ\epsilonitalic_ϵ-smooth version is defined analogously by considering all distributions (C,D)𝐶𝐷(C,D)( italic_C , italic_D ) that are within ϵitalic-ϵ\epsilonitalic_ϵ of (A,B)𝐴𝐵(A,B)( italic_A , italic_B ) and taking the maximum among them). This definition is too strict: it takes the worst-case b𝑏bitalic_b, while for randomness extraction (and many other settings, such as predictability by an adversary), average-case b𝑏bitalic_b suffices. Average min-entropy leads to more extractable bits. Nevertheless, after smoothing the two notions are equivalent up to an additive log(1ϵ)1italic-ϵ\log\left({\frac{1}{\epsilon}}\right)roman_log ( divide start_ARG 1 end_ARG start_ARG italic_ϵ end_ARG ) term: 𝐇~ϵ(AB)𝐇ϵ(AB)superscriptsubscript~𝐇italic-ϵconditional𝐴𝐵superscriptsubscript𝐇italic-ϵconditional𝐴𝐵{\tilde{\mathbf{H}}_{\infty}^{\epsilon}}(A\mid B)\geq{\mathbf{H}_{\infty}^{\epsilon}}(A\mid B)over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT ( italic_A ∣ italic_B ) ≥ bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT ( italic_A ∣ italic_B ) and 𝐇ϵ+ϵ2(AB)𝐇~ϵ(AB)log(1ϵ2)superscriptsubscript𝐇italic-ϵsubscriptitalic-ϵ2conditional𝐴𝐵superscriptsubscript~𝐇italic-ϵconditional𝐴𝐵1subscriptitalic-ϵ2{\mathbf{H}_{\infty}}^{\epsilon+\epsilon_{2}}(A\mid B)\geq{\tilde{\mathbf{H}}_{\infty}^{\epsilon}}(A\mid B)-\log\left({\frac{1}{\epsilon_{2}}}\right)bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ + italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_A ∣ italic_B ) ≥ over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT ( italic_A ∣ italic_B ) - roman_log ( divide start_ARG 1 end_ARG start_ARG italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ) (for the case of ϵ=0italic-ϵ0\epsilon=0italic_ϵ = 0, this follows by constructing a new distribution that eliminates all b𝑏bitalic_b for which 𝐇(AB=b)<𝐇~(AB)log(1ϵ2)subscript𝐇conditional𝐴𝐵𝑏subscript~𝐇conditional𝐴𝐵1subscriptitalic-ϵ2{\mathbf{H}_{\infty}}(A\mid B=b)<{\tilde{\mathbf{H}}_{\infty}}(A\mid B)-\log\left({\frac{1}{\epsilon_{2}}}\right)bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_A ∣ italic_B = italic_b ) < over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_A ∣ italic_B ) - roman_log ( divide start_ARG 1 end_ARG start_ARG italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ), which will be within ϵ2subscriptitalic-ϵ2\epsilon_{2}italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT of the (A,B)𝐴𝐵(A,B)( italic_A , italic_B ) by Markov’s inequality; for ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0, an analogous proof works). Note that by Lemma 2.2(b), this implies a simple chain rule for 𝐇ϵsuperscriptsubscript𝐇italic-ϵ{\mathbf{H}_{\infty}^{\epsilon}}bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT (a more general one is given in [RW05, Section 2.4]): 𝐇ϵ+ϵ2(AB)𝐇~ϵ((A,B))H0(B)log(1ϵ2)superscriptsubscript𝐇italic-ϵsubscriptitalic-ϵ2conditional𝐴𝐵superscriptsubscript~𝐇italic-ϵ𝐴𝐵subscript𝐻0𝐵1subscriptitalic-ϵ2{\mathbf{H}_{\infty}}^{\epsilon+\epsilon_{2}}(A\mid B)\geq{\tilde{\mathbf{H}}_{\infty}^{\epsilon}}((A,B))-H_{0}(B)-\log\left({\frac{1}{\epsilon_{2}}}\right)bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ + italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_A ∣ italic_B ) ≥ over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT ( ( italic_A , italic_B ) ) - italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_B ) - roman_log ( divide start_ARG 1 end_ARG start_ARG italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ), where H0(B)subscript𝐻0𝐵H_{0}(B)italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_B ) is the logarithm of the number of possible values of B𝐵Bitalic_B.

Appendix C Lower Bounds from Coding

Recall that an (,K,t)𝐾𝑡({\cal M},K,t)( caligraphic_M , italic_K , italic_t ) code is a subset of the metric space {\cal M}caligraphic_M which can correct t𝑡titalic_t errors (this is slightly different from the usual notation of coding theory literature).

Let K(,t)𝐾𝑡K({\cal M},t)italic_K ( caligraphic_M , italic_t ) be the largest K𝐾Kitalic_K for which there exists an (,K,t)𝐾𝑡({\cal M},K,t)( caligraphic_M , italic_K , italic_t )-code. Given any set S𝑆Sitalic_S of 2msuperscript2𝑚2^{m}2 start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT points in {\cal M}caligraphic_M, we let K(,t,S)𝐾𝑡𝑆K({\cal M},t,S)italic_K ( caligraphic_M , italic_t , italic_S ) be the largest K𝐾Kitalic_K such that there exists an (,K,t)𝐾𝑡({\cal M},K,t)( caligraphic_M , italic_K , italic_t )-code all of whose K𝐾Kitalic_K points belong to S𝑆Sitalic_S. Finally, we let L(,t,m)=log(min|S|=2mK(n,t,S))𝐿𝑡𝑚subscript𝑆superscript2𝑚𝐾𝑛𝑡𝑆L({\cal M},t,m)=\log(\min_{|S|=2^{m}}K(n,t,S))italic_L ( caligraphic_M , italic_t , italic_m ) = roman_log ( roman_min start_POSTSUBSCRIPT | italic_S | = 2 start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_K ( italic_n , italic_t , italic_S ) ). Of course, when m=log||𝑚m=\log|{\cal M}|italic_m = roman_log | caligraphic_M |, we get L(,t,n)=logK(,t)𝐿𝑡𝑛𝐾𝑡L({\cal M},t,n)=\allowbreak\log K({\cal M},t)italic_L ( caligraphic_M , italic_t , italic_n ) = roman_log italic_K ( caligraphic_M , italic_t ). The exact determination of quantities K(,t)𝐾𝑡K({\cal M},t)italic_K ( caligraphic_M , italic_t ) and K(,t,S)𝐾𝑡𝑆K({\cal M},t,S)italic_K ( caligraphic_M , italic_t , italic_S ) is a central problem of coding theory and is typically very hard. To the best of our knowledge, the quantity L(,t,m)𝐿𝑡𝑚L({\cal M},t,m)italic_L ( caligraphic_M , italic_t , italic_m ) was not explicitly studied in any of three metrics that we study, and its exact determination seems hard as well.

We give two simple lower bounds on the entropy loss (one for secure sketches, the other for fuzzy extractors) which show that our constructions for the Hamming and set difference metrics output as much entropy msuperscript𝑚m^{\prime}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as possible when the original input distribution is uniform. In particular, because the constructions have the same entropy loss regardless of m𝑚mitalic_m, they are optimal in terms of the entropy loss mm𝑚superscript𝑚m-m^{\prime}italic_m - italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. We conjecture that the constructions also have the highest possible value msuperscript𝑚m^{\prime}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for all values of m𝑚mitalic_m, but we do not have a good enough understanding of L(,t,m)𝐿𝑡𝑚L({\cal M},t,m)italic_L ( caligraphic_M , italic_t , italic_m ) (where {\cal M}caligraphic_M is the Hamming metric) to substantiate the conjecture.

Lemma C.1.

The existence of an (,m,m,t)𝑚superscript𝑚normal-′𝑡({\cal M},m,m^{\prime},t)( caligraphic_M , italic_m , italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_t ) secure sketch implies that mL(,t,m)superscript𝑚normal-′𝐿𝑡𝑚m^{\prime}\leq L({\cal M},t,m)italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_L ( caligraphic_M , italic_t , italic_m ). In particular, when m=log||𝑚m=\log|{\cal M}|italic_m = roman_log | caligraphic_M | (i.e., when the password is truly uniform), mlogK(,t)superscript𝑚normal-′𝐾𝑡m^{\prime}\leq\log K({\cal M},t)italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ roman_log italic_K ( caligraphic_M , italic_t ).

Proof.

Assume 𝖲𝖲𝖲𝖲\mathsf{SS}sansserif_SS is such a secure sketch. Let S𝑆Sitalic_S be any set of size 2msuperscript2𝑚2^{m}2 start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT in {\cal M}caligraphic_M, and let W𝑊Witalic_W be uniform over S𝑆Sitalic_S. Then we must have 𝐇~(W𝖲𝖲(W))msubscript~𝐇conditional𝑊𝖲𝖲𝑊superscript𝑚{\tilde{\mathbf{H}}_{\infty}}(W\mid\mathsf{SS}(W))\geq m^{\prime}over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_W ∣ sansserif_SS ( italic_W ) ) ≥ italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. In particular, there must be some value v𝑣vitalic_v such that 𝐇(W𝖲𝖲(W)=v)msubscript𝐇conditional𝑊𝖲𝖲𝑊𝑣superscript𝑚{\mathbf{H}_{\infty}}(W\mid\mathsf{SS}(W)=v)\geq m^{\prime}bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_W ∣ sansserif_SS ( italic_W ) = italic_v ) ≥ italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. But this means that conditioned on 𝖲𝖲(W)=v𝖲𝖲𝑊𝑣\mathsf{SS}(W)=vsansserif_SS ( italic_W ) = italic_v, there are at least 2msuperscript2superscript𝑚2^{m^{\prime}}2 start_POSTSUPERSCRIPT italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT points w𝑤witalic_w in S𝑆Sitalic_S (call this set T𝑇Titalic_T) which could produce 𝖲𝖲(W)=v𝖲𝖲𝑊𝑣\mathsf{SS}(W)=vsansserif_SS ( italic_W ) = italic_v. We claim that these 2msuperscript2superscript𝑚2^{m^{\prime}}2 start_POSTSUPERSCRIPT italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT values of w𝑤witalic_w form a code of error-correcting distance t𝑡titalic_t. Indeed, otherwise there would be a point wsuperscript𝑤w^{\prime}\in{\cal M}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_M such that 𝖽𝗂𝗌(w0,w)t𝖽𝗂𝗌subscript𝑤0superscript𝑤𝑡{\mathsf{dis}(w_{0},w^{\prime})}\leq tsansserif_dis ( italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ italic_t and 𝖽𝗂𝗌(w1,w)t𝖽𝗂𝗌subscript𝑤1superscript𝑤𝑡{\mathsf{dis}(w_{1},w^{\prime})}\leq tsansserif_dis ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ italic_t for some w0,w1Tsubscript𝑤0subscript𝑤1𝑇w_{0},w_{1}\in Titalic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ italic_T. But then we must have that 𝖱𝖾𝖼(w,v)𝖱𝖾𝖼superscript𝑤𝑣\mathsf{Rec}(w^{\prime},v)sansserif_Rec ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_v ) is equal to both w0subscript𝑤0w_{0}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and w1subscript𝑤1w_{1}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, which is impossible. Thus, the set T𝑇Titalic_T above must form an (,2m,t)superscript2superscript𝑚𝑡({\cal M},2^{m^{\prime}},t)( caligraphic_M , 2 start_POSTSUPERSCRIPT italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , italic_t )-code inside S𝑆Sitalic_S, which means that mlogK(,t,S)superscript𝑚𝐾𝑡𝑆m^{\prime}\leq\log K({\cal M},t,S)italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ roman_log italic_K ( caligraphic_M , italic_t , italic_S ). Since S𝑆Sitalic_S was arbitrary, the bound follows. ∎

Lemma C.2.

The existence of (,m,,t,ϵ)𝑚normal-ℓ𝑡italic-ϵ({\cal M},m,\ell,t,\epsilon)( caligraphic_M , italic_m , roman_ℓ , italic_t , italic_ϵ )-fuzzy extractors implies that L(,t,m)log(1ϵ)normal-ℓ𝐿𝑡𝑚1italic-ϵ\ell\leq L({\cal M},t,m)-\log(1-\epsilon)roman_ℓ ≤ italic_L ( caligraphic_M , italic_t , italic_m ) - roman_log ( 1 - italic_ϵ ). In particular, when m=log||𝑚m=\log|{\cal M}|italic_m = roman_log | caligraphic_M | (i.e., when the password is truly uniform), logK(,t)log(1ϵ)normal-ℓ𝐾𝑡1italic-ϵ\ell\leq\log K({\cal M},t)-\log(1-\epsilon)roman_ℓ ≤ roman_log italic_K ( caligraphic_M , italic_t ) - roman_log ( 1 - italic_ϵ ).

Proof.

Assume (𝖦𝖾𝗇,𝖱𝖾𝗉)𝖦𝖾𝗇𝖱𝖾𝗉(\mathsf{Gen},\mathsf{Rep})( sansserif_Gen , sansserif_Rep ) is such a fuzzy extractor. Let S𝑆Sitalic_S be any set of size 2msuperscript2𝑚2^{m}2 start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT in {\cal M}caligraphic_M, let W𝑊Witalic_W be uniform over S𝑆Sitalic_S and let (R,P)𝖦𝖾𝗇(W)𝑅𝑃𝖦𝖾𝗇𝑊(R,P)\leftarrow\mathsf{Gen}(W)( italic_R , italic_P ) ← sansserif_Gen ( italic_W ). Then we must have 𝐒𝐃((R,P),(U,P))ϵ𝐒𝐃𝑅𝑃subscript𝑈𝑃italic-ϵ\mathbf{SD}\left({{({R,P}),({U_{\ell},P})}}\right)\leq\epsilonbold_SD ( ( italic_R , italic_P ) , ( italic_U start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_P ) ) ≤ italic_ϵ. In particular, there must be some value p𝑝pitalic_p of P𝑃Pitalic_P such that R𝑅Ritalic_R is ϵitalic-ϵ\epsilonitalic_ϵ-close to Usubscript𝑈U_{\ell}italic_U start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT conditioned on P=p𝑃𝑝P=pitalic_P = italic_p. In particular, this means that conditioned on P=p𝑃𝑝P=pitalic_P = italic_p, there are at least (1ϵ)21italic-ϵsuperscript2(1-\epsilon)2^{\ell}( 1 - italic_ϵ ) 2 start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT points r{0,1}𝑟superscript01r\in\{0,1\}^{\ell}italic_r ∈ { 0 , 1 } start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT (call this set T𝑇Titalic_T) which could be extracted with P=p𝑃𝑝P=pitalic_P = italic_p. Now, map every rT𝑟𝑇r\in Titalic_r ∈ italic_T to some arbitrary wS𝑤𝑆w\in Sitalic_w ∈ italic_S which could have produced r𝑟ritalic_r with nonzero probability given P=p𝑃𝑝P=pitalic_P = italic_p, and call this map C𝐶Citalic_C. C𝐶Citalic_C must define a code with error-correcting distance t𝑡titalic_t by the same reasoning as in Lemma C.1. ∎

Observe that, as long as ϵ<1/2italic-ϵ12\epsilon<1/2italic_ϵ < 1 / 2, we have 0<log(1ϵ)<101italic-ϵ10<-\log(1-\epsilon)<10 < - roman_log ( 1 - italic_ϵ ) < 1, so the lower bounds on secure sketches and fuzzy extractors differ by less than a bit.

Appendix D Analysis of the Original Juels-Sudan Construction

In this section we present a new analysis for the Juels-Sudan secure sketch for set difference. We will assume that n=|𝒰|𝑛𝒰n=|{\cal U}|italic_n = | caligraphic_U | is a prime power and work over the field =𝐺𝐹(n)𝐺𝐹𝑛{\cal F}=\mathit{GF}(n)caligraphic_F = italic_GF ( italic_n ). On input set w𝑤witalic_w, the original Juels-Sudan sketch is a list of r𝑟ritalic_r pairs of points (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) in {\cal F}caligraphic_F, for some parameter r𝑟ritalic_r, s<rn𝑠𝑟𝑛s<r\leq nitalic_s < italic_r ≤ italic_n. It is computed as follows:

Construction 10 (Original Juels-Sudan Secure Sketch [JS06]).

Input: a set w𝑤w\subseteq{\cal F}italic_w ⊆ caligraphic_F of size s𝑠sitalic_s and parameters r{s+1,,n},t{1,,s}formulae-sequence𝑟𝑠1𝑛𝑡1𝑠r\in\left\{{s+1,\dots,n}\right\},t\in\left\{{1,\dots,s}\right\}italic_r ∈ { italic_s + 1 , … , italic_n } , italic_t ∈ { 1 , … , italic_s }

  • 1.

    Choose p()𝑝p()italic_p ( ) at random from the set of polynomials of degree at most k=st1𝑘𝑠𝑡1k=s-t-1italic_k = italic_s - italic_t - 1 over {\cal F}caligraphic_F.
    Write w={x1,,xs}𝑤subscript𝑥1subscript𝑥𝑠w=\left\{{x_{1},\dots,x_{s}}\right\}italic_w = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT }, and let yi=p(xi)subscript𝑦𝑖𝑝subscript𝑥𝑖y_{i}=p(x_{i})italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) for i=1,,s𝑖1𝑠i=1,\dots,sitalic_i = 1 , … , italic_s.

  • 2.

    Choose rs𝑟𝑠r-sitalic_r - italic_s distinct points xs+1,,xrsubscript𝑥𝑠1subscript𝑥𝑟x_{s+1},\dots,x_{r}italic_x start_POSTSUBSCRIPT italic_s + 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT at random from w𝑤{\cal F}-wcaligraphic_F - italic_w.

  • 3.

    For i=s+1,,r𝑖𝑠1𝑟i=s+1,\dots,ritalic_i = italic_s + 1 , … , italic_r, choose yisubscript𝑦𝑖y_{i}\in{\cal F}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_F at random such that yip(xi)subscript𝑦𝑖𝑝subscript𝑥𝑖y_{i}\neq p(x_{i})italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

  • 4.

    Output 𝖲𝖲(w)={(x1,y1),,(xr,yr)}𝖲𝖲𝑤subscript𝑥1subscript𝑦1subscript𝑥𝑟subscript𝑦𝑟\mathsf{SS}(w)=\left\{{(x_{1},y_{1}),\dots,(x_{r},y_{r})}\right\}sansserif_SS ( italic_w ) = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) } (in lexicographic order of xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT).

The parameter t𝑡titalic_t measures the error-tolerance of the scheme: given 𝖲𝖲(w)𝖲𝖲𝑤\mathsf{SS}(w)sansserif_SS ( italic_w ) and a set wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT such that wwt𝑤superscript𝑤𝑡w\triangle w^{\prime}\leq titalic_w △ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_t, one can recover w𝑤witalic_w by considering the pairs (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) for xiwsubscript𝑥𝑖superscript𝑤x_{i}\in w^{\prime}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and running Reed-Solomon decoding to recover the low-degree polynomial p()𝑝p(\cdot)italic_p ( ⋅ ). When the parameter r𝑟ritalic_r is very small, the scheme corrects approximately twice as many errors with good probability (in the “input-dependent” sense from Section 8). When r𝑟ritalic_r is low, however, we show here that the bound on the entropy loss becomes very weak.

The parameter r𝑟ritalic_r dictates the amount of storage necessary, one on hand, and also the security of the scheme (that is, for r=s𝑟𝑠r=sitalic_r = italic_s the scheme leaks all information and for larger and larger r𝑟ritalic_r there is less information about w𝑤witalic_w). Juels and Sudan actually propose two analyses for the scheme. First, they analyze the case where the secret w𝑤witalic_w is distributed uniformly over all subsets of size s𝑠sitalic_s. Second, they provide an analysis of a nonuniform password distribution, but only for the case r=n𝑟𝑛r=nitalic_r = italic_n (that is, their analysis applies only in the small universe setting, where Ω(n)Ω𝑛\Omega(n)roman_Ω ( italic_n ) storage is acceptable). Here we give a simpler analysis which handles nonuniformity and any rn𝑟𝑛r\leq nitalic_r ≤ italic_n. We get the same results for a broader set of parameters.

Lemma D.1.

The entropy loss of the Juels-Sudan scheme is at most tlogn+log(nr)log(nsrs)+2𝑡𝑛binomial𝑛𝑟binomial𝑛𝑠𝑟𝑠2t\log n+\log{\binom{n}{r}}-\log{\binom{n-s}{r-s}}+2italic_t roman_log italic_n + roman_log ( FRACOP start_ARG italic_n end_ARG start_ARG italic_r end_ARG ) - roman_log ( FRACOP start_ARG italic_n - italic_s end_ARG start_ARG italic_r - italic_s end_ARG ) + 2.

Proof.

This is a simple application of Lemma 2.2(b). 𝐇((W,𝖲𝖲(W)))subscript𝐇𝑊𝖲𝖲𝑊{\mathbf{H}_{\infty}}((W,\mathsf{SS}(W)))bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( ( italic_W , sansserif_SS ( italic_W ) ) ) can be computed as follows. Choosing the polynomial p𝑝pitalic_p (which can be uniquely recovered from w𝑤witalic_w and 𝖲𝖲(w)𝖲𝖲𝑤\mathsf{SS}(w)sansserif_SS ( italic_w )) requires st𝑠𝑡s-titalic_s - italic_t random choices from {\cal F}caligraphic_F. The choice of the remaining xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s requires log(nsrs)binomial𝑛𝑠𝑟𝑠\log{\binom{n-s}{r-s}}roman_log ( FRACOP start_ARG italic_n - italic_s end_ARG start_ARG italic_r - italic_s end_ARG ) bits, and choosing the yissuperscriptsubscript𝑦𝑖𝑠y_{i}^{\prime}sitalic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_s requires rs𝑟𝑠r-sitalic_r - italic_s random choices from {p(xi)}𝑝subscript𝑥𝑖{\cal F}-\left\{{p(x_{i})}\right\}caligraphic_F - { italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) }. Thus, 𝐇((W,𝖲𝖲(W)))=𝐇(W)+(st)logn+log(nsrs)+(rs)log(n1)subscript𝐇𝑊𝖲𝖲𝑊subscript𝐇𝑊𝑠𝑡𝑛binomial𝑛𝑠𝑟𝑠𝑟𝑠𝑛1{\mathbf{H}_{\infty}}((W,\mathsf{SS}(W)))={\mathbf{H}_{\infty}}(W)+(s-t)\log n+\log{\binom{n-s}{r-s}}+(r-s)\log(n-1)bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( ( italic_W , sansserif_SS ( italic_W ) ) ) = bold_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_W ) + ( italic_s - italic_t ) roman_log italic_n + roman_log ( FRACOP start_ARG italic_n - italic_s end_ARG start_ARG italic_r - italic_s end_ARG ) + ( italic_r - italic_s ) roman_log ( italic_n - 1 ). The output can be described in log((nr)nr)binomial𝑛𝑟superscript𝑛𝑟\log\left({{\binom{n}{r}}n^{r}}\right)roman_log ( ( FRACOP start_ARG italic_n end_ARG start_ARG italic_r end_ARG ) italic_n start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ) bits. The result follows by Lemma 2.2(b) after observing that (rs)lognn1<nlognn12𝑟𝑠𝑛𝑛1𝑛𝑛𝑛12(r-s)\log\frac{n}{n-1}<n\log\frac{n}{n-1}\leq 2( italic_r - italic_s ) roman_log divide start_ARG italic_n end_ARG start_ARG italic_n - 1 end_ARG < italic_n roman_log divide start_ARG italic_n end_ARG start_ARG italic_n - 1 end_ARG ≤ 2. ∎

In the large universe setting, we will have rnmuch-less-than𝑟𝑛r\ll nitalic_r ≪ italic_n (since we wish to have storage polynomial in s𝑠sitalic_s). In that setting, the bound on the entropy loss of the Juels-Sudan scheme is in fact very large. We can rewrite the entropy loss as tlognlog(rs)+log(ns)+2𝑡𝑛binomial𝑟𝑠binomial𝑛𝑠2t\log n-\log{\binom{r}{s}}+\log{\binom{n}{s}}+2italic_t roman_log italic_n - roman_log ( FRACOP start_ARG italic_r end_ARG start_ARG italic_s end_ARG ) + roman_log ( FRACOP start_ARG italic_n end_ARG start_ARG italic_s end_ARG ) + 2, using the identity (nr)(rs)=(ns)(nsrs)binomial𝑛𝑟binomial𝑟𝑠binomial𝑛𝑠binomial𝑛𝑠𝑟𝑠{\binom{n}{r}}{\binom{r}{s}}={\binom{n}{s}}{\binom{n-s}{r-s}}( FRACOP start_ARG italic_n end_ARG start_ARG italic_r end_ARG ) ( FRACOP start_ARG italic_r end_ARG start_ARG italic_s end_ARG ) = ( FRACOP start_ARG italic_n end_ARG start_ARG italic_s end_ARG ) ( FRACOP start_ARG italic_n - italic_s end_ARG start_ARG italic_r - italic_s end_ARG ). Now the entropy of W𝑊Witalic_W is at most (ns)binomial𝑛𝑠{\binom{n}{s}}( FRACOP start_ARG italic_n end_ARG start_ARG italic_s end_ARG ), and so our lower bound on the remaining entropy is (log(rs)tlogn2)binomial𝑟𝑠𝑡𝑛2(\log{\binom{r}{s}}-t\log n-2)( roman_log ( FRACOP start_ARG italic_r end_ARG start_ARG italic_s end_ARG ) - italic_t roman_log italic_n - 2 ). To make this quantity large requires making r𝑟ritalic_r very large.

Appendix E BCH Syndrome Decoding in Sublinear Time

We show that the standard decoding algorithm for BCH codes can be modified to run in time polynomial in the length of the syndrome. This works for BCH codes over any field 𝐺𝐹(q)𝐺𝐹𝑞\mathit{GF}(q)italic_GF ( italic_q ), which include Hamming codes in the binary case and Reed-Solomon for the case n=q1𝑛𝑞1n=q-1italic_n = italic_q - 1. BCH codes are handled in detail in many textbooks (e.g., [vL92]); our presentation here is quite terse. For simplicity, we discuss only primitive, narrow-sense BCH codes here; the discussion extends easily to the general case.

The algorithm discussed here has been revised due to an error pointed out by Ari Trachtenberg. Its implementation is available [HJR06].

We’ll use a slightly nonstandard formulation of BCH codes. Let n=qm1𝑛superscript𝑞𝑚1n=q^{m}-1italic_n = italic_q start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT - 1 (in the binary case of interest in Section 6.3, q=2𝑞2q=2italic_q = 2). We will work in two finite fields: 𝐺𝐹(q)𝐺𝐹𝑞\mathit{GF}(q)italic_GF ( italic_q ) and a larger extension field =𝐺𝐹(qm)𝐺𝐹superscript𝑞𝑚{\cal F}=\mathit{GF}(q^{m})caligraphic_F = italic_GF ( italic_q start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ). BCH codewords, formally defined below, are then vectors in 𝐺𝐹(q)n𝐺𝐹superscript𝑞𝑛\mathit{GF}(q)^{n}italic_GF ( italic_q ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. In most common presentations, one indexes the n𝑛nitalic_n positions of these vectors by discrete logarithms of the elements of *superscript{\cal F}^{*}caligraphic_F start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT: position i𝑖iitalic_i, for 1in1𝑖𝑛1\leq i\leq n1 ≤ italic_i ≤ italic_n, corresponds to αisuperscript𝛼𝑖\alpha^{i}italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, where α𝛼\alphaitalic_α generates the multiplicative group *superscript{\cal F}^{*}caligraphic_F start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT. However, there is no inherent reason to do so: they can be indexed by elements of {\cal F}caligraphic_F directly rather than by their discrete logarithms. Thus, we say that a word has value pxsubscript𝑝𝑥p_{x}italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT at position x𝑥xitalic_x, where x*𝑥superscriptx\in{\cal F}^{*}italic_x ∈ caligraphic_F start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT. If one ever needs to write down the entire n𝑛nitalic_n-character word in an ordered fashion, one can arbitrarily choose a convenient ordering of the elements of {\cal F}caligraphic_F (e.g., by using some standard binary representation of field elements); for our purposes this is not necessary, as we do not store entire n𝑛nitalic_n-bit words explicitly, but rather represent them by their supports: 𝗌𝗎𝗉𝗉(v)={(x,px)px0}𝗌𝗎𝗉𝗉𝑣conditional-set𝑥subscript𝑝𝑥subscript𝑝𝑥0{\mathsf{supp}}(v)=\{(x,p_{x})\mid p_{x}\neq 0\}sansserif_supp ( italic_v ) = { ( italic_x , italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) ∣ italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ≠ 0 }. Note that for the binary case of interest in Section 6.3, we can define 𝗌𝗎𝗉𝗉(v)={xpx0}𝗌𝗎𝗉𝗉𝑣conditional-set𝑥subscript𝑝𝑥0{\mathsf{supp}}(v)=\{x\mid p_{x}\neq 0\}sansserif_supp ( italic_v ) = { italic_x ∣ italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ≠ 0 }, because pxsubscript𝑝𝑥p_{x}italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT can take only two values: 0 or 1.

Our choice of representation will be crucial for efficient decoding: in the more common representation, the last step of the decoding algorithm requires one to find the position i𝑖iitalic_i of the error from the field element αisuperscript𝛼𝑖\alpha^{i}italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. However, no efficient algorithms for computing the discrete logarithm are known if qmsuperscript𝑞𝑚q^{m}italic_q start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is large (indeed, a lot of cryptography is based on the assumption that such an efficient algorithm does not exist). In our representation, the field element αisuperscript𝛼𝑖\alpha^{i}italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT will in fact be the position of the error.

Definition 8.

The (narrow-sense, primitive) BCH code of designed distance δ𝛿\deltaitalic_δ over 𝐺𝐹(q)𝐺𝐹𝑞\mathit{GF}(q)italic_GF ( italic_q ) (of length nδ𝑛𝛿n\geq\deltaitalic_n ≥ italic_δ) is given by the set of vectors of the form (cx)x*subscriptsubscript𝑐𝑥𝑥superscript\big{(}c_{x}\big{)}_{x\in{\cal F}^{*}}( italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_x ∈ caligraphic_F start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT such that each cxsubscript𝑐𝑥c_{x}italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT is in the smaller field 𝐺𝐹(q)𝐺𝐹𝑞\mathit{GF}(q)italic_GF ( italic_q ), and the vector satisfies the constraints x*cxxi=0subscript𝑥superscriptsubscript𝑐𝑥superscript𝑥𝑖0\sum_{x\in{\cal F}^{*}}c_{x}x^{i}=0∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_F start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = 0, for i=1,,δ1𝑖1𝛿1i=1,\ldots,\delta-1italic_i = 1 , … , italic_δ - 1, with arithmetic done in the larger field {\cal F}caligraphic_F.

To explain this definition, let us fix a generator α𝛼\alphaitalic_α of the multiplicative group of the large field *superscript{\cal F}^{*}caligraphic_F start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT. For any vector of coefficients (cx)x*subscriptsubscript𝑐𝑥𝑥superscript\big{(}c_{x}\big{)}_{x\in{\cal F}^{*}}( italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_x ∈ caligraphic_F start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, we can define a polynomial

c(z)=x𝐺𝐹(qm)*cxz𝖽𝗅𝗈𝗀(x),𝑐𝑧subscript𝑥𝐺𝐹superscriptsuperscript𝑞𝑚subscript𝑐𝑥superscript𝑧𝖽𝗅𝗈𝗀𝑥c(z)=\sum_{x\in\mathit{GF}(q^{m})^{*}}c_{x}z^{\mathsf{dlog}(x)}\,,italic_c ( italic_z ) = ∑ start_POSTSUBSCRIPT italic_x ∈ italic_GF ( italic_q start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT sansserif_dlog ( italic_x ) end_POSTSUPERSCRIPT ,

where 𝖽𝗅𝗈𝗀(x)𝖽𝗅𝗈𝗀𝑥\mathsf{dlog}(x)sansserif_dlog ( italic_x ) is the discrete logarithm of x𝑥xitalic_x with respect to α𝛼\alphaitalic_α. The conditions of the definition are then equivalent to the requirement (more commonly seen in presentations of BCH codes) that c(αi)=0𝑐superscript𝛼𝑖0c(\alpha^{i})=0italic_c ( italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) = 0 for i=1,,δ1𝑖1𝛿1i=1,\ldots,\delta-1italic_i = 1 , … , italic_δ - 1, because (αi)𝖽𝗅𝗈𝗀(x)=(α𝖽𝗅𝗈𝗀(x))i=xisuperscriptsuperscript𝛼𝑖𝖽𝗅𝗈𝗀𝑥superscriptsuperscript𝛼𝖽𝗅𝗈𝗀𝑥𝑖superscript𝑥𝑖(\alpha^{i})^{\mathsf{dlog}(x)}=(\alpha^{\mathsf{dlog}(x)})^{i}=x^{i}( italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT sansserif_dlog ( italic_x ) end_POSTSUPERSCRIPT = ( italic_α start_POSTSUPERSCRIPT sansserif_dlog ( italic_x ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT.

We can simplify this somewhat. Because the coefficients cxsubscript𝑐𝑥c_{x}italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT are in 𝐺𝐹(q)𝐺𝐹𝑞\mathit{GF}(q)italic_GF ( italic_q ), they satisfy cxq=cxsuperscriptsubscript𝑐𝑥𝑞subscript𝑐𝑥c_{x}^{q}=c_{x}italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT = italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. Using the identity (x+y)q=xq+yqsuperscript𝑥𝑦𝑞superscript𝑥𝑞superscript𝑦𝑞(x+y)^{q}=x^{q}+y^{q}( italic_x + italic_y ) start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT = italic_x start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT + italic_y start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT, which holds even in the large field {\cal F}caligraphic_F, we have c(αi)q=x0cxqxiq=c(αiq)𝑐superscriptsuperscript𝛼𝑖𝑞subscript𝑥0superscriptsubscript𝑐𝑥𝑞superscript𝑥𝑖𝑞𝑐superscript𝛼𝑖𝑞c(\alpha^{i})^{q}=\sum_{x\neq 0}c_{x}^{q}x^{iq}=c(\alpha^{iq})italic_c ( italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_x ≠ 0 end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_i italic_q end_POSTSUPERSCRIPT = italic_c ( italic_α start_POSTSUPERSCRIPT italic_i italic_q end_POSTSUPERSCRIPT ). Thus, roughly a 1/q1𝑞1/q1 / italic_q fraction of the conditions in the definition are redundant: we need only to check that they hold for i{1,,δ1}𝑖1𝛿1i\in\left\{{1,\dots,\delta-1}\right\}italic_i ∈ { 1 , … , italic_δ - 1 } such that qiconditional𝑞𝑖q\not|iitalic_q |̸ italic_i.

The syndrome of a word (not necessarily a codeword) (px)x*𝐺𝐹(q)nsubscriptsubscript𝑝𝑥𝑥superscript𝐺𝐹superscript𝑞𝑛(p_{x})_{x\in{\cal F}^{*}}\in\mathit{GF}(q)^{n}( italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_x ∈ caligraphic_F start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∈ italic_GF ( italic_q ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT with respect to the BCH code above is the vector

𝗌𝗒𝗇(p)=p(α1),,p(αδ1),wherep(αi)=x*pxxi.formulae-sequence𝗌𝗒𝗇𝑝𝑝superscript𝛼1𝑝superscript𝛼𝛿1where𝑝superscript𝛼𝑖subscript𝑥superscriptsubscript𝑝𝑥superscript𝑥𝑖{\mathsf{syn}}(p)=p(\alpha^{1}),\ldots,p(\alpha^{\delta-1}),\quad\text{where}\quad p(\alpha^{i})=\sum_{x\in{\cal F}^{*}}p_{x}x^{i}.sansserif_syn ( italic_p ) = italic_p ( italic_α start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) , … , italic_p ( italic_α start_POSTSUPERSCRIPT italic_δ - 1 end_POSTSUPERSCRIPT ) , where italic_p ( italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_F start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT .

As mentioned above, we do not in fact have to include the values p(αi)𝑝superscript𝛼𝑖p(\alpha^{i})italic_p ( italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) such that q|iconditional𝑞𝑖q|iitalic_q | italic_i.

Computing with Low-Weight Words.  A low-weight word p𝐺𝐹(q)n𝑝𝐺𝐹superscript𝑞𝑛p\in\mathit{GF}(q)^{n}italic_p ∈ italic_GF ( italic_q ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT can be represented either as a long string or, more compactly, as a list of positions where it is nonzero and its values at those points. We call this representation the support list of p𝑝pitalic_p and denote it 𝗌𝗎𝗉𝗉(p)={(x,px)}x:px0𝗌𝗎𝗉𝗉𝑝subscript𝑥subscript𝑝𝑥:𝑥subscript𝑝𝑥0{\mathsf{supp}}(p)=\left\{{(x,p_{x})}\right\}_{x:p_{x}\neq 0}sansserif_supp ( italic_p ) = { ( italic_x , italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_x : italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ≠ 0 end_POSTSUBSCRIPT.

Lemma E.1.

For a q𝑞qitalic_q-ary BCH code C𝐶Citalic_C of designed distance δ𝛿\deltaitalic_δ, one can compute:

  • 1.

    𝗌𝗒𝗇(p)𝗌𝗒𝗇𝑝{\mathsf{syn}}(p)sansserif_syn ( italic_p ) from 𝗌𝗎𝗉𝗉(p)𝗌𝗎𝗉𝗉𝑝{\mathsf{supp}}(p)sansserif_supp ( italic_p ) in time polynomial in δ𝛿\deltaitalic_δ, logn𝑛\log nroman_log italic_n, and |𝗌𝗎𝗉𝗉(p)|𝗌𝗎𝗉𝗉𝑝|{\mathsf{supp}}(p)|| sansserif_supp ( italic_p ) |, and

  • 2.

    𝗌𝗎𝗉𝗉(p)𝗌𝗎𝗉𝗉𝑝{\mathsf{supp}}(p)sansserif_supp ( italic_p ) from 𝗌𝗒𝗇(p)𝗌𝗒𝗇𝑝{\mathsf{syn}}(p)sansserif_syn ( italic_p ) (when p𝑝pitalic_p has weight at most (δ1)/2𝛿12(\delta-1)/2( italic_δ - 1 ) / 2), in time polynomial in δ𝛿\deltaitalic_δ and logn𝑛\log nroman_log italic_n.

Proof.

Recall that 𝗌𝗒𝗇(p)=(p(α),,p(αδ1))𝗌𝗒𝗇𝑝𝑝𝛼𝑝superscript𝛼𝛿1{\mathsf{syn}}(p)=(p(\alpha),\dots,p(\alpha^{\delta-1}))sansserif_syn ( italic_p ) = ( italic_p ( italic_α ) , … , italic_p ( italic_α start_POSTSUPERSCRIPT italic_δ - 1 end_POSTSUPERSCRIPT ) ) where p(αi)=x0pxxi𝑝superscript𝛼𝑖subscript𝑥0subscript𝑝𝑥superscript𝑥𝑖p(\alpha^{i})=\sum_{x\neq 0}p_{x}x^{i}italic_p ( italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_x ≠ 0 end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. Part (1) is easy, since to compute the syndrome we need only to compute the powers of x𝑥xitalic_x. This requires about δ𝗐𝖾𝗂𝗀𝗁𝗍(p)𝛿𝗐𝖾𝗂𝗀𝗁𝗍𝑝\delta\cdot{\mathsf{weight}}(p)italic_δ ⋅ sansserif_weight ( italic_p ) multiplications in {\cal F}caligraphic_F. For Part (2), we adapt Berlekamp’s BCH decoding algorithm, based on its presentation in [vL92]. Let M={x*|px0}𝑀conditional-set𝑥superscriptsubscript𝑝𝑥0M=\left\{{x\in{\cal F}^{*}|p_{x}\neq 0}\right\}italic_M = { italic_x ∈ caligraphic_F start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT | italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ≠ 0 }, and define

σ(z)=defxM(1xz)andω(z)=defσ(z)xMpxxz(1xz).formulae-sequencesuperscriptdef𝜎𝑧subscriptproduct𝑥𝑀1𝑥𝑧andsuperscriptdef𝜔𝑧𝜎𝑧subscript𝑥𝑀subscript𝑝𝑥𝑥𝑧1𝑥𝑧\sigma(z)\stackrel{{\scriptstyle\rm def}}{{=}}\prod_{x\in M}(1-xz)\quad\mbox{and}\quad\omega(z)\stackrel{{\scriptstyle\rm def}}{{=}}\sigma(z)\sum_{x\in M}\frac{p_{x}xz}{(1-xz)}\,.italic_σ ( italic_z ) start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG roman_def end_ARG end_RELOP ∏ start_POSTSUBSCRIPT italic_x ∈ italic_M end_POSTSUBSCRIPT ( 1 - italic_x italic_z ) and italic_ω ( italic_z ) start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG roman_def end_ARG end_RELOP italic_σ ( italic_z ) ∑ start_POSTSUBSCRIPT italic_x ∈ italic_M end_POSTSUBSCRIPT divide start_ARG italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_x italic_z end_ARG start_ARG ( 1 - italic_x italic_z ) end_ARG .

Since (1xz)1𝑥𝑧(1-xz)( 1 - italic_x italic_z ) divides σ(z)𝜎𝑧\sigma(z)italic_σ ( italic_z ) for xM𝑥𝑀x\in Mitalic_x ∈ italic_M, we see that ω(z)𝜔𝑧\omega(z)italic_ω ( italic_z ) is in fact a polynomial of degree at most |M|=𝗐𝖾𝗂𝗀𝗁𝗍(p)(δ1)/2𝑀𝗐𝖾𝗂𝗀𝗁𝗍𝑝𝛿12|M|={\mathsf{weight}}(p)\leq(\delta-1)/2| italic_M | = sansserif_weight ( italic_p ) ≤ ( italic_δ - 1 ) / 2. The polynomials σ(z)𝜎𝑧\sigma(z)italic_σ ( italic_z ) and ω(z)𝜔𝑧\omega(z)italic_ω ( italic_z ) are known as the error locator polynomial and evaluator polynomial, respectively; observe that gcd(σ(z),ω(z))=1𝜎𝑧𝜔𝑧1\gcd(\sigma(z),\omega(z))=1roman_gcd ( italic_σ ( italic_z ) , italic_ω ( italic_z ) ) = 1.

We will in fact work with our polynomials modulo zδsuperscript𝑧𝛿z^{\delta}italic_z start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT. In this arithmetic the inverse of (1xz)1𝑥𝑧(1-xz)( 1 - italic_x italic_z ) is =1δ(xz)1superscriptsubscript1𝛿superscript𝑥𝑧1\sum_{\ell=1}^{\delta}(xz)^{\ell-1}∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT ( italic_x italic_z ) start_POSTSUPERSCRIPT roman_ℓ - 1 end_POSTSUPERSCRIPT; that is,

(1xz)=1δ(xz)11modzδ.1𝑥𝑧superscriptsubscript1𝛿superscript𝑥𝑧1modulo1superscript𝑧𝛿(1-xz)\sum_{\ell=1}^{\delta}(xz)^{\ell-1}\equiv 1\mod z^{\delta}.( 1 - italic_x italic_z ) ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT ( italic_x italic_z ) start_POSTSUPERSCRIPT roman_ℓ - 1 end_POSTSUPERSCRIPT ≡ 1 roman_mod italic_z start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT .

We are given p(α)𝑝superscript𝛼p(\alpha^{\ell})italic_p ( italic_α start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT ) for =1,,δ1𝛿\ell=1,\dots,\deltaroman_ℓ = 1 , … , italic_δ. Let S(z)==1δ1p(α)z𝑆𝑧superscriptsubscript1𝛿1𝑝superscript𝛼superscript𝑧S(z)=\sum_{\ell=1}^{\delta-1}p(\alpha^{\ell})z^{\ell}italic_S ( italic_z ) = ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_δ - 1 end_POSTSUPERSCRIPT italic_p ( italic_α start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT ) italic_z start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT. Note that S(z)xMpxxz(1xz)modzδ𝑆𝑧modulosubscript𝑥𝑀subscript𝑝𝑥𝑥𝑧1𝑥𝑧superscript𝑧𝛿S(z)\equiv\sum_{x\in M}p_{x}\frac{xz}{(1-xz)}\mod z^{\delta}italic_S ( italic_z ) ≡ ∑ start_POSTSUBSCRIPT italic_x ∈ italic_M end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT divide start_ARG italic_x italic_z end_ARG start_ARG ( 1 - italic_x italic_z ) end_ARG roman_mod italic_z start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT. This implies that

S(z)σ(z)ω(z)modzδ.𝑆𝑧𝜎𝑧modulo𝜔𝑧superscript𝑧𝛿S(z)\sigma(z)\equiv\omega(z)\mod{z^{\delta}}.italic_S ( italic_z ) italic_σ ( italic_z ) ≡ italic_ω ( italic_z ) roman_mod italic_z start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT .

The polynomials σ(z)𝜎𝑧\sigma(z)italic_σ ( italic_z ) and ω(z)𝜔𝑧\omega(z)italic_ω ( italic_z ) satisfy the following four conditions: they are of degree at most (δ1)/2𝛿12(\delta-1)/2( italic_δ - 1 ) / 2 each, they are relatively prime, the constant coefficient of σ𝜎\sigmaitalic_σ is 1, and they satisfy this congruence. In fact, let w(z),σ(z)superscript𝑤𝑧superscript𝜎𝑧w^{\prime}(z),\sigma^{\prime}(z)italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z ) , italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z ) be any nonzero solution to this congruence, where degrees of w(z)superscript𝑤𝑧w^{\prime}(z)italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z ) and σ(z)superscript𝜎𝑧\sigma^{\prime}(z)italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z ) are at most (δ1)/2𝛿12(\delta-1)/2( italic_δ - 1 ) / 2. Then w(z)/σ(z)=ω(z)/σ(z)superscript𝑤𝑧superscript𝜎𝑧𝜔𝑧𝜎𝑧w^{\prime}(z)/\sigma^{\prime}(z)=\omega(z)/\sigma(z)italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z ) / italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z ) = italic_ω ( italic_z ) / italic_σ ( italic_z ). (To see why this is so, multiply the initial congruence by σ()superscript𝜎\sigma^{\prime}()italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ) to get ω(z)σ(z)σ(z)ω(z)modzδ𝜔𝑧superscript𝜎𝑧modulo𝜎𝑧superscript𝜔𝑧superscript𝑧𝛿\omega(z)\sigma^{\prime}(z)\equiv\sigma(z)\omega^{\prime}(z)\mod z^{\delta}italic_ω ( italic_z ) italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z ) ≡ italic_σ ( italic_z ) italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z ) roman_mod italic_z start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT. Since both sides of the congruence have degree at most δ1𝛿1\delta-1italic_δ - 1, they are in fact equal as polynomials.) Thus, there is at most one solution σ(z),ω(z)𝜎𝑧𝜔𝑧\sigma(z),\omega(z)italic_σ ( italic_z ) , italic_ω ( italic_z ) satisfying all four conditions, which can be obtained from any σ(z),ω(z)superscript𝜎𝑧superscript𝜔𝑧\sigma^{\prime}(z),\omega^{\prime}(z)italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z ) , italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z ) by reducing the resulting fraction ω(z)/σ(z)superscript𝜔𝑧superscript𝜎𝑧\omega^{\prime}(z)/\sigma^{\prime}(z)italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z ) / italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z ) to obtain the solution of minimal degree with the constant term of σ𝜎\sigmaitalic_σ equal to 1.

Finally, the roots of σ(z)𝜎𝑧\sigma(z)italic_σ ( italic_z ) are the points x1superscript𝑥1x^{-1}italic_x start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT for xM𝑥𝑀x\in Mitalic_x ∈ italic_M, and the exact value of pxsubscript𝑝𝑥p_{x}italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT can be recovered from ω(x1)=pxyM,yx(1yx1)𝜔superscript𝑥1subscript𝑝𝑥subscriptproductformulae-sequence𝑦𝑀𝑦𝑥1𝑦superscript𝑥1\omega(x^{-1})=p_{x}\prod_{y\in M,y\neq x}(1-yx^{-1})italic_ω ( italic_x start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) = italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_y ∈ italic_M , italic_y ≠ italic_x end_POSTSUBSCRIPT ( 1 - italic_y italic_x start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) (this is needed only for q>2𝑞2q>2italic_q > 2, because for q=2𝑞2q=2italic_q = 2, px=1subscript𝑝𝑥1p_{x}=1italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = 1). Note that it is possible that a solution to the congruence will be found even if the input syndrome is not a syndrome of any p𝑝pitalic_p with 𝗐𝖾𝗂𝗀𝗁𝗍(p)>(δ1)/2𝗐𝖾𝗂𝗀𝗁𝗍𝑝𝛿12{\mathsf{weight}}(p)>(\delta-1)/2sansserif_weight ( italic_p ) > ( italic_δ - 1 ) / 2 (it is also possible that a solution to the congruence will not be found at all, or that the resulting σ(z)𝜎𝑧\sigma(z)italic_σ ( italic_z ) will not split into distinct nonzero roots). Such a solution will not give the correct p𝑝pitalic_p. Thus, if there is no guarantee that 𝗐𝖾𝗂𝗀𝗁𝗍(p)𝗐𝖾𝗂𝗀𝗁𝗍𝑝{\mathsf{weight}}(p)sansserif_weight ( italic_p ) is actually at most (δ1)/2𝛿12(\delta-1)/2( italic_δ - 1 ) / 2, it is necessary to recompute 𝗌𝗒𝗇(p)𝗌𝗒𝗇𝑝{\mathsf{syn}}(p)sansserif_syn ( italic_p ) after finding the solution, in order to verify that p𝑝pitalic_p is indeed correct.

Representing coefficients of σ(z)superscript𝜎𝑧\sigma^{\prime}(z)italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z ) and ω(z)superscript𝜔𝑧\omega^{\prime}(z)italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z ) as unknowns, we see that solving the congruence requires only solving a system of δ𝛿\deltaitalic_δ linear equations (one for each degree of z𝑧zitalic_z, from 0 to δ1𝛿1\delta-1italic_δ - 1) involving δ+1𝛿1\delta+1italic_δ + 1 variables over {\cal F}caligraphic_F, which can be done in O(δ3)𝑂superscript𝛿3O(\delta^{3})italic_O ( italic_δ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) operations in {\cal F}caligraphic_F using, e.g., Gaussian elimination. The reduction of the fraction ω(z)/σ(z)superscript𝜔𝑧superscript𝜎𝑧\omega^{\prime}(z)/\sigma^{\prime}(z)italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z ) / italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z ) requires simply running Euclid’s algorithm for finding the g.c.d. of two polynomials of degree less than δ𝛿\deltaitalic_δ, which takes O(δ2)𝑂superscript𝛿2O(\delta^{2})italic_O ( italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) operations in {\cal F}caligraphic_F. Suppose the resulting σ𝜎\sigmaitalic_σ has degree e𝑒eitalic_e. Then one can find the roots of σ𝜎\sigmaitalic_σ as follows. First test that σ𝜎\sigmaitalic_σ indeed has e𝑒eitalic_e distinct roots by testing that σ(z)|zqmzconditional𝜎𝑧superscript𝑧superscript𝑞𝑚𝑧\sigma(z)|z^{q^{m}}-zitalic_σ ( italic_z ) | italic_z start_POSTSUPERSCRIPT italic_q start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_z (this is a necessary and sufficient condition, because every element of {\cal F}caligraphic_F is a root of zqmzsuperscript𝑧superscript𝑞𝑚𝑧z^{q^{m}}-zitalic_z start_POSTSUPERSCRIPT italic_q start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_z exactly once). This can be done by computing (zqmmodσ(z))modulosuperscript𝑧superscript𝑞𝑚𝜎𝑧(z^{q^{m}}\bmod\sigma(z))( italic_z start_POSTSUPERSCRIPT italic_q start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT roman_mod italic_σ ( italic_z ) ) and testing if it equals zmodσmodulo𝑧𝜎z\bmod\sigmaitalic_z roman_mod italic_σ; it takes m𝑚mitalic_m exponentiations of a polynomial to the power q𝑞qitalic_q, i.e., O((mlogq)e2)𝑂𝑚𝑞superscript𝑒2O((m\log q)e^{2})italic_O ( ( italic_m roman_log italic_q ) italic_e start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) operations in {\cal F}caligraphic_F. Then apply an equal-degree-factorization algorithm (e.g., as described in [Sho05]), which also takes O((mlogq)e2)𝑂𝑚𝑞superscript𝑒2O((m\log q)e^{2})italic_O ( ( italic_m roman_log italic_q ) italic_e start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) operations in {\cal F}caligraphic_F. Finally, after taking inverses of the roots of {\cal F}caligraphic_F and finding pxsubscript𝑝𝑥p_{x}italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT (which takes O(e2)𝑂superscript𝑒2O(e^{2})italic_O ( italic_e start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) operations in {\cal F}caligraphic_F), recompute 𝗌𝗒𝗇(p)𝗌𝗒𝗇𝑝{\mathsf{syn}}(p)sansserif_syn ( italic_p ) to verify that it is equal to the input value.

Because mlogq=log(n+1)𝑚𝑞𝑛1m\log q=\log(n+1)italic_m roman_log italic_q = roman_log ( italic_n + 1 ) and e(δ1)/2𝑒𝛿12e\leq(\delta-1)/2italic_e ≤ ( italic_δ - 1 ) / 2, the total running time is O(δ3+δ2logn)𝑂superscript𝛿3superscript𝛿2𝑛O(\delta^{3}+\delta^{2}\log n)italic_O ( italic_δ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT + italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_n ) operations in {\cal F}caligraphic_F; each operation in {\cal F}caligraphic_F can done in time O(log2n)𝑂superscript2𝑛O(\log^{2}n)italic_O ( roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n ), or faster using advanced techniques.

One can improve this running time substantially. The error locator polynomial σ()𝜎\sigma()italic_σ ( ) can be found in O(logδ)𝑂𝛿O(\log\delta)italic_O ( roman_log italic_δ ) convolutions (multiplications) of polynomials over {\cal F}caligraphic_F of degree (δ1)/2𝛿12(\delta-1)/2( italic_δ - 1 ) / 2 each [Bla83, Section 11.7] by exploiting the special structure of the system of linear equations being solved. Each convolution can be performed asymptotically in time O(δlogδloglogδ)𝑂𝛿𝛿𝛿O(\delta\log\delta\log\log\delta)italic_O ( italic_δ roman_log italic_δ roman_log roman_log italic_δ ) (see, e.g., [vzGG03]), and the total time required to find σ𝜎\sigmaitalic_σ gets reduced to O(δlog2δloglogδ)𝑂𝛿superscript2𝛿𝛿O(\delta\log^{2}\delta\log\log\delta)italic_O ( italic_δ roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ roman_log roman_log italic_δ ) operation in {\cal F}caligraphic_F. This replaces the δ3superscript𝛿3\delta^{3}italic_δ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT term in the above running time.

While this is asymptotically very good, Euclidean-algorithm-based decoding [SKHN75], which runs in O(δ2)𝑂superscript𝛿2O(\delta^{2})italic_O ( italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) operations in {\cal F}caligraphic_F, will find σ(z)𝜎𝑧\sigma(z)italic_σ ( italic_z ) faster for reasonable values of δ𝛿\deltaitalic_δ (certainly for δ<1000)\delta<1000)italic_δ < 1000 ). The algorithm finds σ𝜎\sigmaitalic_σ as follows:

set Rold(z)zδ1subscript𝑅old𝑧superscript𝑧𝛿1{R_{\mathrm{old}}}(z)\leftarrow z^{\delta-1}italic_R start_POSTSUBSCRIPT roman_old end_POSTSUBSCRIPT ( italic_z ) ← italic_z start_POSTSUPERSCRIPT italic_δ - 1 end_POSTSUPERSCRIPT, Rcur(z)S(z)/zsubscript𝑅cur𝑧𝑆𝑧𝑧{R_{\mathrm{cur}}}(z)\leftarrow S(z)/zitalic_R start_POSTSUBSCRIPT roman_cur end_POSTSUBSCRIPT ( italic_z ) ← italic_S ( italic_z ) / italic_z, Vold(z)0subscript𝑉old𝑧0{V_{\mathrm{old}}}(z)\leftarrow 0italic_V start_POSTSUBSCRIPT roman_old end_POSTSUBSCRIPT ( italic_z ) ← 0, Vcur(z)1subscript𝑉cur𝑧1{V_{\mathrm{cur}}}(z)\leftarrow 1italic_V start_POSTSUBSCRIPT roman_cur end_POSTSUBSCRIPT ( italic_z ) ← 1.
while deg(Rcur(z))(δ1)/2degreesubscript𝑅cur𝑧𝛿12\deg({R_{\mathrm{cur}}}(z))\geq(\delta-1)/2roman_deg ( italic_R start_POSTSUBSCRIPT roman_cur end_POSTSUBSCRIPT ( italic_z ) ) ≥ ( italic_δ - 1 ) / 2:
divide Rold(z)subscript𝑅old𝑧{R_{\mathrm{old}}}(z)italic_R start_POSTSUBSCRIPT roman_old end_POSTSUBSCRIPT ( italic_z ) by Rcur(z)subscript𝑅cur𝑧{R_{\mathrm{cur}}}(z)italic_R start_POSTSUBSCRIPT roman_cur end_POSTSUBSCRIPT ( italic_z ) to get quotient q(z)𝑞𝑧q(z)italic_q ( italic_z ) and remainder Rnew(z)subscript𝑅new𝑧{R_{\mathrm{new}}}(z)italic_R start_POSTSUBSCRIPT roman_new end_POSTSUBSCRIPT ( italic_z );
set Vnew(z)Vold(z)q(z)Vcur(z)subscript𝑉new𝑧subscript𝑉old𝑧𝑞𝑧subscript𝑉cur𝑧{V_{\mathrm{new}}}(z)\leftarrow{V_{\mathrm{old}}}(z)-q(z){V_{\mathrm{cur}}}(z)italic_V start_POSTSUBSCRIPT roman_new end_POSTSUBSCRIPT ( italic_z ) ← italic_V start_POSTSUBSCRIPT roman_old end_POSTSUBSCRIPT ( italic_z ) - italic_q ( italic_z ) italic_V start_POSTSUBSCRIPT roman_cur end_POSTSUBSCRIPT ( italic_z );
set Rold(z)Rcur(z),Rcur(z)Rnew(z),Vold(z)Vcur(z),formulae-sequencesubscript𝑅old𝑧subscript𝑅cur𝑧formulae-sequencesubscript𝑅cur𝑧subscript𝑅new𝑧subscript𝑉old𝑧subscript𝑉cur𝑧{R_{\mathrm{old}}}(z)\leftarrow{R_{\mathrm{cur}}}(z),{R_{\mathrm{cur}}}(z)\leftarrow{R_{\mathrm{new}}}(z),{V_{\mathrm{old}}}(z)\leftarrow{V_{\mathrm{cur}}}(z),italic_R start_POSTSUBSCRIPT roman_old end_POSTSUBSCRIPT ( italic_z ) ← italic_R start_POSTSUBSCRIPT roman_cur end_POSTSUBSCRIPT ( italic_z ) , italic_R start_POSTSUBSCRIPT roman_cur end_POSTSUBSCRIPT ( italic_z ) ← italic_R start_POSTSUBSCRIPT roman_new end_POSTSUBSCRIPT ( italic_z ) , italic_V start_POSTSUBSCRIPT roman_old end_POSTSUBSCRIPT ( italic_z ) ← italic_V start_POSTSUBSCRIPT roman_cur end_POSTSUBSCRIPT ( italic_z ) , Vcur(z)Vnew(z)subscript𝑉cur𝑧subscript𝑉new𝑧{V_{\mathrm{cur}}}(z)\leftarrow{V_{\mathrm{new}}}(z)italic_V start_POSTSUBSCRIPT roman_cur end_POSTSUBSCRIPT ( italic_z ) ← italic_V start_POSTSUBSCRIPT roman_new end_POSTSUBSCRIPT ( italic_z ).
set cVcur(0)𝑐subscript𝑉cur0c\leftarrow{V_{\mathrm{cur}}}(0)italic_c ← italic_V start_POSTSUBSCRIPT roman_cur end_POSTSUBSCRIPT ( 0 ); set σ(z)Vcur(z)/c𝜎𝑧subscript𝑉cur𝑧𝑐\sigma(z)\leftarrow{V_{\mathrm{cur}}}(z)/citalic_σ ( italic_z ) ← italic_V start_POSTSUBSCRIPT roman_cur end_POSTSUBSCRIPT ( italic_z ) / italic_c and ω(z)zRcur(z)/c𝜔𝑧𝑧subscript𝑅cur𝑧𝑐\omega(z)\leftarrow z\cdot{R_{\mathrm{cur}}}(z)/citalic_ω ( italic_z ) ← italic_z ⋅ italic_R start_POSTSUBSCRIPT roman_cur end_POSTSUBSCRIPT ( italic_z ) / italic_c

In the above algorithm, if c=0𝑐0c=0italic_c = 0, then the correct σ(z)𝜎𝑧\sigma(z)italic_σ ( italic_z ) does not exist, i.e., 𝗐𝖾𝗂𝗀𝗁𝗍(p)>(δ1)/2𝗐𝖾𝗂𝗀𝗁𝗍𝑝𝛿12{\mathsf{weight}}(p)>(\delta-1)/2sansserif_weight ( italic_p ) > ( italic_δ - 1 ) / 2. The correctness of this algorithm can be seen by observing that the congruence S(z)σ(z)ω(z)(modzδ)𝑆𝑧𝜎𝑧annotated𝜔𝑧pmodsuperscript𝑧𝛿S(z)\sigma(z)\equiv\omega(z)\pmod{z^{\delta}}italic_S ( italic_z ) italic_σ ( italic_z ) ≡ italic_ω ( italic_z ) start_MODIFIER ( roman_mod start_ARG italic_z start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT end_ARG ) end_MODIFIER can have z𝑧zitalic_z factored out of it (because S(z)𝑆𝑧S(z)italic_S ( italic_z ), ω(z)𝜔𝑧\omega(z)italic_ω ( italic_z ) and zδsuperscript𝑧𝛿z^{\delta}italic_z start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT are all divisible by z𝑧zitalic_z) and rewritten as (S(z)/z)σ(z)+u(z)zδ1=ω(z)/z𝑆𝑧𝑧𝜎𝑧𝑢𝑧superscript𝑧𝛿1𝜔𝑧𝑧(S(z)/z)\sigma(z)+u(z)z^{\delta-1}=\omega(z)/z( italic_S ( italic_z ) / italic_z ) italic_σ ( italic_z ) + italic_u ( italic_z ) italic_z start_POSTSUPERSCRIPT italic_δ - 1 end_POSTSUPERSCRIPT = italic_ω ( italic_z ) / italic_z, for some u(z)𝑢𝑧u(z)italic_u ( italic_z ). The obtained σ𝜎\sigmaitalic_σ is easily shown to be the correct one (if one exists at all) by applying [Sho05, Theorem 18.7] (to use the notation of that theorem, set n=zδ1,y=S(z)/z,t*=r*=(δ1)/2,r=ω(z)/z,s=u(z),t=σ(z)formulae-sequenceformulae-sequence𝑛superscript𝑧𝛿1formulae-sequence𝑦𝑆𝑧𝑧superscript𝑡superscript𝑟𝛿12formulae-sequencesuperscript𝑟𝜔𝑧𝑧formulae-sequencesuperscript𝑠𝑢𝑧superscript𝑡𝜎𝑧n=z^{\delta-1},y=S(z)/z,t^{*}=r^{*}=(\delta-1)/2,r^{\prime}=\omega(z)/z,s^{\prime}=u(z),t^{\prime}=\sigma(z)italic_n = italic_z start_POSTSUPERSCRIPT italic_δ - 1 end_POSTSUPERSCRIPT , italic_y = italic_S ( italic_z ) / italic_z , italic_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = italic_r start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = ( italic_δ - 1 ) / 2 , italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_ω ( italic_z ) / italic_z , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_u ( italic_z ) , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_σ ( italic_z )).

The root finding of σ𝜎\sigmaitalic_σ can also be sped up. Asymptotically, detecting if a polynomial over =𝐺𝐹(qm)=𝐺𝐹(n+1)𝐺𝐹superscript𝑞𝑚𝐺𝐹𝑛1{\cal F}=\mathit{GF}(q^{m})=\mathit{GF}(n+1)caligraphic_F = italic_GF ( italic_q start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) = italic_GF ( italic_n + 1 ) of degree e𝑒eitalic_e has e𝑒eitalic_e distinct roots and finding these roots can be performed in time O(e1.815(logn)0.407)𝑂superscript𝑒1.815superscript𝑛0.407O(e^{1.815}(\log n)^{0.407})italic_O ( italic_e start_POSTSUPERSCRIPT 1.815 end_POSTSUPERSCRIPT ( roman_log italic_n ) start_POSTSUPERSCRIPT 0.407 end_POSTSUPERSCRIPT ) operations in {\cal F}caligraphic_F using the algorithm of Kaltofen and Shoup [KS95], or in time O(e2+(logn)elogelogloge)𝑂superscript𝑒2𝑛𝑒𝑒𝑒O(e^{2}+(\log n)e\log e\log\log e)italic_O ( italic_e start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( roman_log italic_n ) italic_e roman_log italic_e roman_log roman_log italic_e ) operations in {\cal F}caligraphic_F using the EDF algorithm of Cantor and Zassenhaus131313See [Sho05, Section 21.3], and substitute the most efficient known polynomial arithmetic. For example, the procedures described in [vzGG03] take time O(elogelogloge)𝑂𝑒𝑒𝑒O(e\log e\log\log e)italic_O ( italic_e roman_log italic_e roman_log roman_log italic_e ) instead of time O(e2)𝑂superscript𝑒2O(e^{2})italic_O ( italic_e start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) to perform modular arithmetic operations with degree-e𝑒eitalic_e polynomials.. For reasonable values of e𝑒eitalic_e, the Cantor-Zassenhaus EDF algorithm with Karatsuba’s multiplication algorithm [KO63] for polynomials will be faster, giving root-finding running time of O(e2+elog23logn)𝑂superscript𝑒2superscript𝑒subscript23𝑛O(e^{2}+e^{\log_{2}3}\log n)italic_O ( italic_e start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_e start_POSTSUPERSCRIPT roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 3 end_POSTSUPERSCRIPT roman_log italic_n ) operations in {\cal F}caligraphic_F. Note that if the actual weight e𝑒eitalic_e of p𝑝pitalic_p is close to the maximum tolerated (δ1)/2𝛿12(\delta-1)/2( italic_δ - 1 ) / 2, then finding the roots of σ𝜎\sigmaitalic_σ will actually take longer than finding σ𝜎\sigmaitalic_σ. ∎

A Dual View of the Algorithm.  Readers may be used to seeing a different, evaluation-based formulation of BCH codes, in which codewords are generated as follows. Let {\cal F}caligraphic_F again be an extension of 𝐺𝐹(q)𝐺𝐹𝑞\mathit{GF}(q)italic_GF ( italic_q ), and let n𝑛nitalic_n be the length of the code (note that |*|superscript|{\cal F}^{*}|| caligraphic_F start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT | is not necessarily equal to n𝑛nitalic_n in this formulation). Fix distinct x1,x2,,xnsubscript𝑥1subscript𝑥2subscript𝑥𝑛x_{1},x_{2},\dots,x_{n}\in{\cal F}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_F. For every polynomial c𝑐citalic_c over the large field {\cal F}caligraphic_F of degree at most nδ𝑛𝛿n-\deltaitalic_n - italic_δ, the vector (c(x1),c(x2),c(xn))𝑐subscript𝑥1𝑐subscript𝑥2𝑐subscript𝑥𝑛(c(x_{1}),c(x_{2}),\dots c(x_{n}))( italic_c ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_c ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , … italic_c ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) is a codeword if and only if every coordinate of the vector happens to be in the smaller field: c(xi)𝐺𝐹(q)𝑐subscript𝑥𝑖𝐺𝐹𝑞c(x_{i})\in\mathit{GF}(q)italic_c ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ italic_GF ( italic_q ) for all i𝑖iitalic_i. In particular, when =𝐺𝐹(q)𝐺𝐹𝑞{\cal F}=\mathit{GF}(q)caligraphic_F = italic_GF ( italic_q ), then every polynomial leads to a codeword, thus giving Reed-Solomon codes.

The syndrome in this formulation can be computed as follows: given a vector y=(y1,y2,,yn)𝑦subscript𝑦1subscript𝑦2subscript𝑦𝑛y=(y_{1},y_{2},\dots,y_{n})italic_y = ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) find the interpolating polynomial P=pn1xn1+pn2xn2++p0𝑃subscript𝑝𝑛1superscript𝑥𝑛1subscript𝑝𝑛2superscript𝑥𝑛2subscript𝑝0P=p_{n-1}x^{n-1}+p_{n-2}x^{n-2}+\dots+p_{0}italic_P = italic_p start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT + italic_p start_POSTSUBSCRIPT italic_n - 2 end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_n - 2 end_POSTSUPERSCRIPT + ⋯ + italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT over {\cal F}caligraphic_F of degree at most n1𝑛1n-1italic_n - 1 such that P(xi)=yi𝑃subscript𝑥𝑖subscript𝑦𝑖P(x_{i})=y_{i}italic_P ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for all i𝑖iitalic_i. The syndrome is then the negative top δ1𝛿1\delta-1italic_δ - 1 coefficients of P𝑃Pitalic_P: 𝗌𝗒𝗇(y)=(pn1,pn2,,pn(δ1))𝗌𝗒𝗇𝑦subscript𝑝𝑛1subscript𝑝𝑛2subscript𝑝𝑛𝛿1{\mathsf{syn}}(y)=(-p_{n-1},-p_{n-2},\dots,-p_{n-(\delta-1)})sansserif_syn ( italic_y ) = ( - italic_p start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , - italic_p start_POSTSUBSCRIPT italic_n - 2 end_POSTSUBSCRIPT , … , - italic_p start_POSTSUBSCRIPT italic_n - ( italic_δ - 1 ) end_POSTSUBSCRIPT ). (It is easy to see that this is a syndrome: it is a linear function that is zero exactly on the codewords.)

When n=||1𝑛1n=|{\cal F}|-1italic_n = | caligraphic_F | - 1, we can index the n𝑛nitalic_n-component vectors by elements of *superscript{\cal F}^{*}caligraphic_F start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, writing codewords as (c(x))x*subscript𝑐𝑥𝑥superscript(c(x))_{x\in{\cal F}^{*}}( italic_c ( italic_x ) ) start_POSTSUBSCRIPT italic_x ∈ caligraphic_F start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. In this case, the syndrome of (yx)x*subscriptsubscript𝑦𝑥𝑥superscript(y_{x})_{x\in{\cal F}^{*}}( italic_y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_x ∈ caligraphic_F start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT defined as the negative top δ1𝛿1\delta-1italic_δ - 1 coefficients of P𝑃Pitalic_P such that for all xF*,P(x)=yxformulae-sequence𝑥superscript𝐹𝑃𝑥subscript𝑦𝑥x\in F^{*},P(x)=y_{x}italic_x ∈ italic_F start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_P ( italic_x ) = italic_y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT is equal to the syndrome defined following Definition 8 as xyxxisubscript𝑥subscript𝑦𝑥superscript𝑥𝑖\sum_{x\in{\cal F}}y_{x}x^{i}∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_F end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT for i=1,2,,δ1𝑖12𝛿1i=1,2,\dots,\delta-1italic_i = 1 , 2 , … , italic_δ - 1. 141414 This statement can be shown as follows: because both maps are linear, it is sufficient to prove that they agree on a vector (yx)x*subscriptsubscript𝑦𝑥𝑥superscript(y_{x})_{x\in{\cal F}^{*}}( italic_y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_x ∈ caligraphic_F start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT such that ya=1subscript𝑦𝑎1y_{a}=1italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 1 for some a*𝑎superscripta\in{\cal F}^{*}italic_a ∈ caligraphic_F start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT and yx=0subscript𝑦𝑥0y_{x}=0italic_y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = 0 for xa𝑥𝑎x\neq aitalic_x ≠ italic_a. For such a vector, xyxxi=aisubscript𝑥subscript𝑦𝑥superscript𝑥𝑖superscript𝑎𝑖\sum_{x\in{\cal F}}y_{x}x^{i}=a^{i}∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_F end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. On the other hand, the interpolating polynomial P(x)𝑃𝑥P(x)italic_P ( italic_x ) such that P(x)=yx𝑃𝑥subscript𝑦𝑥P(x)=y_{x}italic_P ( italic_x ) = italic_y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT is axn1a2xn2an1x1𝑎superscript𝑥𝑛1superscript𝑎2superscript𝑥𝑛2superscript𝑎𝑛1𝑥1-ax^{n-1}-a^{2}x^{n-2}-\dots-a^{n-1}x-1- italic_a italic_x start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT - italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_n - 2 end_POSTSUPERSCRIPT - ⋯ - italic_a start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_x - 1 (indeed, P(a)=n=1𝑃𝑎𝑛1P(a)=-n=1italic_P ( italic_a ) = - italic_n = 1; furthermore, multiplying P(x)𝑃𝑥P(x)italic_P ( italic_x ) by xa𝑥𝑎x-aitalic_x - italic_a gives a(xn1)𝑎superscript𝑥𝑛1a(x^{n}-1)italic_a ( italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - 1 ), which is zero on all of *superscript{\cal F}^{*}caligraphic_F start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT; hence P(x)𝑃𝑥P(x)italic_P ( italic_x ) is zero for every xa𝑥𝑎x\neq aitalic_x ≠ italic_a). Thus, when n=||1𝑛1n=|{\cal F}|-1italic_n = | caligraphic_F | - 1, the codewords obtained via the evaluation-based definition are identical to the codewords obtain via Definition 8, because codewords are simply elements with the zero syndrome, and the syndrome maps agree.

This is an example of a remarkable duality between evaluations of polynomials and their coefficients: the syndrome can be viewed either as the evaluation of a polynomial whose coefficients are given by the vector, or as the coefficients of the polynomial whose evaluations are given by a vector.

The syndrome decoding algorithm above has a natural interpretation in the evaluation-based view. Our presentation is an adaptation of Welch-Berlekamp decoding as presented in, e.g., [Sud01, Chapter 10].

Suppose n=|F|1𝑛𝐹1n=|F|-1italic_n = | italic_F | - 1 and x1,,xnsubscript𝑥1subscript𝑥𝑛x_{1},\dots,x_{n}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are the nonzero elements of the field. Let y=(y1,y2,,yn)𝑦subscript𝑦1subscript𝑦2subscript𝑦𝑛y=(y_{1},y_{2},\dots,y_{n})italic_y = ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) be a vector. We are given its syndrome 𝗌𝗒𝗇(y)=(pn1,pn2,,pn(δ1))𝗌𝗒𝗇𝑦subscript𝑝𝑛1subscript𝑝𝑛2subscript𝑝𝑛𝛿1{\mathsf{syn}}(y)=(-p_{n-1},-p_{n-2},\dots,\allowbreak-p_{n-(\delta-1)})sansserif_syn ( italic_y ) = ( - italic_p start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , - italic_p start_POSTSUBSCRIPT italic_n - 2 end_POSTSUBSCRIPT , … , - italic_p start_POSTSUBSCRIPT italic_n - ( italic_δ - 1 ) end_POSTSUBSCRIPT ), where pn1,,pn(δ1)subscript𝑝𝑛1subscript𝑝𝑛𝛿1p_{n-1},\dots,p_{n-(\delta-1)}italic_p start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_n - ( italic_δ - 1 ) end_POSTSUBSCRIPT are the top coefficients of the interpolating polynomial P𝑃Pitalic_P. Knowing only 𝗌𝗒𝗇(y)𝗌𝗒𝗇𝑦{\mathsf{syn}}(y)sansserif_syn ( italic_y ), we need to find at most (δ1)/2𝛿12(\delta-1)/2( italic_δ - 1 ) / 2 locations xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT such that correcting all the corresponding yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT will result in a codeword. Suppose that codeword is given by a degree-(nδ)𝑛𝛿(n-\delta)( italic_n - italic_δ ) polynomial c𝑐citalic_c. Note that c𝑐citalic_c agrees with P𝑃Pitalic_P on all but the error locations. Let ρ(z)𝜌𝑧\rho(z)italic_ρ ( italic_z ) be the polynomial of degree at most (δ1)/2𝛿12(\delta-1)/2( italic_δ - 1 ) / 2 whose roots are exactly the error locations. (Note that σ(z)𝜎𝑧\sigma(z)italic_σ ( italic_z ) from the decoding algorithm above is the same ρ(z)𝜌𝑧\rho(z)italic_ρ ( italic_z ) but with coefficients in reverse order, because the roots of σ𝜎\sigmaitalic_σ are the inverses of the roots of ρ𝜌\rhoitalic_ρ.) Then ρ(z)P(z)=ρ(z)c(z)𝜌𝑧𝑃𝑧𝜌𝑧𝑐𝑧\rho(z)\cdot P(z)=\rho(z)\cdot c(z)italic_ρ ( italic_z ) ⋅ italic_P ( italic_z ) = italic_ρ ( italic_z ) ⋅ italic_c ( italic_z ) for z=x1,x2,,xn𝑧subscript𝑥1subscript𝑥2subscript𝑥𝑛z=x_{1},x_{2},\dots,x_{n}italic_z = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Since x1,,xnsubscript𝑥1subscript𝑥𝑛x_{1},\dots,x_{n}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are all the nonzero field elements, i=1n(zxi)=zn1superscriptsubscriptproduct𝑖1𝑛𝑧subscript𝑥𝑖superscript𝑧𝑛1\prod_{i=1}^{n}(z-x_{i})=z^{n}-1∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_z - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - 1. Thus,

ρ(z)c(z)=ρ(z)P(z)modi=1n(zxi)=ρ(z)P(z)mod(zn1).𝜌𝑧𝑐𝑧modulo𝜌𝑧𝑃𝑧superscriptsubscriptproduct𝑖1𝑛𝑧subscript𝑥𝑖modulo𝜌𝑧𝑃𝑧superscript𝑧𝑛1\rho(z)\cdot c(z)\quad=\quad\rho(z)\cdot P(z)\bmod\prod_{i=1}^{n}(z-x_{i})\quad=\quad\rho(z)\cdot P(z)\bmod(z^{n}-1)\,.italic_ρ ( italic_z ) ⋅ italic_c ( italic_z ) = italic_ρ ( italic_z ) ⋅ italic_P ( italic_z ) roman_mod ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_z - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_ρ ( italic_z ) ⋅ italic_P ( italic_z ) roman_mod ( italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - 1 ) .

If we write the left-hand side as αn1xn1+αn2xn2++α0subscript𝛼𝑛1superscript𝑥𝑛1subscript𝛼𝑛2superscript𝑥𝑛2subscript𝛼0\alpha_{n-1}x^{n-1}+\alpha_{n-2}x^{n-2}+\cdots+\alpha_{0}italic_α start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT + italic_α start_POSTSUBSCRIPT italic_n - 2 end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_n - 2 end_POSTSUPERSCRIPT + ⋯ + italic_α start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, then the above equation implies that αn1==αn(δ1)/2=0subscript𝛼𝑛1subscript𝛼𝑛𝛿120\alpha_{n-1}=\cdots=\alpha_{n-(\delta-1)/2}=0italic_α start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT = ⋯ = italic_α start_POSTSUBSCRIPT italic_n - ( italic_δ - 1 ) / 2 end_POSTSUBSCRIPT = 0 (because the degree if ρ(z)c(z)𝜌𝑧𝑐𝑧\rho(z)\cdot c(z)italic_ρ ( italic_z ) ⋅ italic_c ( italic_z ) is at most n(δ+1)/2𝑛𝛿12n-(\delta+1)/2italic_n - ( italic_δ + 1 ) / 2). Because αn1,,αn(δ1)/2subscript𝛼𝑛1subscript𝛼𝑛𝛿12\alpha_{n-1},\dots,\alpha_{n-(\delta-1)/2}italic_α start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , … , italic_α start_POSTSUBSCRIPT italic_n - ( italic_δ - 1 ) / 2 end_POSTSUBSCRIPT depend on the coefficients of ρ𝜌\rhoitalic_ρ as well as on pn1,,pn(δ1)subscript𝑝𝑛1subscript𝑝𝑛𝛿1p_{n-1},\dots,p_{n-(\delta-1)}italic_p start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_n - ( italic_δ - 1 ) end_POSTSUBSCRIPT, but not on lower coefficients of P𝑃Pitalic_P, we obtain a system of (δ1)/2𝛿12(\delta-1)/2( italic_δ - 1 ) / 2 equations for (δ1)/2𝛿12(\delta-1)/2( italic_δ - 1 ) / 2 unknown coefficients of ρ𝜌\rhoitalic_ρ. A careful examination shows that it is essentially the same system as we had for σ(z)𝜎𝑧\sigma(z)italic_σ ( italic_z ) in the algorithm above. The lowest-degree solution to this system is indeed the correct ρ𝜌\rhoitalic_ρ, by the same argument which was used to prove the correctness of σ𝜎\sigmaitalic_σ in Lemma E.1. The roots of ρ𝜌\rhoitalic_ρ are the error-locations. For q>2𝑞2q>2italic_q > 2, the actual corrections that are needed at the error locations (in other words, the light vector corresponding to the given syndrome) can then be recovered by solving the linear system of equations implied by the value of the syndrome.